Back when AI was just a theoretical proposition and a curiosity among few people, the real world scarcely recognized its potential prowess. But the furor of generative AI has shined a light on its ability to drive tangible value and its potential for changing the way the world works.
These technologies amass power to automate, augment, and reinvent nearly all aspects of daily life and business. While many believe work becomes easier with generative AI, some worry about potential job displacement. However, generative AI may not be ready to replace humans just yet.
Testing generative AI’s infallibility
The full potential of value creation with generative AI has yet to be realized. But its infallibility is still a work in progress. In an NPR interview, Wharton Professor Ethan Mollick once said to think of it as an ”eager-to-please intern who sometimes lies to you”—at times with complete confidence.
As users across the globe grow more dependent on AI, it becomes harder to recognize and pinpoint the technology’s errors. Outputs that contain mistakes, omissions, and biases are often difficult to distinguish, making it harder to ensure AI-generated content is accurate and reliable. But knowing what we know about human behavior, can we help people before they accept generative AI outputs as completely trustworthy?
A recent field experiment conducted by Accenture and MIT addressed the issue of errors and inaccuracies in generative AI.
Studying the role of behavioral science in technology isn’t novel, especially when the question is about human behavior and cognitive biases with technology. The concern arises when we accept information at face value, particularly when it’s generated by a seemingly infallible AI system. This can lead to over-reliance on AI outputs, increasing the risk of perpetuating errors and misinformation.
Inside the experiment
The field experiment included 140 Accenture research professionals and a tool aimed at nudging users to recognize errors by introducing friction, or “speedbumps,” in the AI-generated output. By connecting behavioral science with generative AI, it encouraged users to engage in “System 2” thinking—they were pushed to think more deliberately and analytically rather than intuitively.
Participants were asked to complete and submit two executive summaries within 70 hours using ChatGPT outputs. They were given AI-generated outputs with varying levels of highlighted text that indicated either correctness, potential errors, or omissions. The highlighting was part of a hypothetical tool designed to enhance error detection. Participants were divided into three groups: a group offered full friction (all types of highlighting), one offered medium friction (error and omission highlighting), and one offered no friction (no highlighting).
The findings revealed that the intense highlighting technique improved error and omission detection but increased task completion time. The medium friction condition seemed to strike the right balance between accuracy and efficiency.
Friction can be good
While the experiment itself “nudged” users to slow down and scrutinize errors and potential errors in AI outputs, the larger implication of these findings extends beyond the immediate context. Consciously adding friction into the AI output process can help companies experiment with AI responsibly and enhance the reliability and transparency of AI-generated content. Promoting a more thorough review is crucial in areas such as healthcare, finance, and legal services, where accuracy is paramount.
By keeping humans in the loop and fostering more deliberative ways of working, companies can scale the use of generative AI tools across their value chain while minimizing inaccuracies and errors. Adding friction, or “speedbumps,” helps in crafting thoughtful prompts for users as they anchor on the output or the content.
One surprising observation from this experiment is that all participants across the three friction conditions self-reported no difference in their response to the follow-up survey statement: “I am more aware of the types of errors to look for when using gen AI.” In other words, they continued to overestimate their ability to identify errors in AI-generated content.
This signals that nudging, or similar forms of quality checks, must continually be tested and incorporated into gen AI deployments so that users don’t reflexively accept gen AI content as accurate—at least not until the technology reaches a more mature stage.
Set up speed bumps, not speed barriers
While introducing friction to the technology helps humans be more engaged in content evaluation, companies still need to proceed with caution. Interventions should influence our behavior so we can make better decisions for ourselves, without interfering with our choice and time or being overly burdensome, which could steal away the advantages that generative AI technology can provide.
What’s most evident from this field test is the importance of companies continuing experimentation as a way to step up AI adoption responsibly and encourage users to think more deliberatively. By fostering a culture of experimenting and critical thinking, organizations can mitigate the risks associated with AI errors and biases. In such an environment, users will become more adept at recognizing potential pitfalls and be better equipped to make informed decisions. This approach not only improves the quality of AI outputs, but also contributes to the overall development of AI literacy among users.
Ultimately, generative AI has become ubiquitous in many facets of our lives. Efforts towards making it more reliable and accurate are necessary, if not mandatory. This study makes it clear that keeping humans in the loop and experimenting during the deployment of AI are strategies that can take scaling AI to the next level and nudge stakeholders across the value chain to partner in building responsible AI.
The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of Fortune.