Researchers and engineers using OpenAI’s Whisper audio transcription tool have said that it often includes hallucinations in its output, commonly manifested as chunks of text that don't accurately reflect the original recording. According to the Associated Press, a University of Michigan researcher said that he found made-up text in 80% of the AI tool’s transcriptions that were inspected, which led to him trying to improve it.
AI hallucination isn’t a new phenomenon, and researchers have been trying to fix this using different tools like semantic entropy. However, what’s troubling is that the Whisper AI audio transcription tool is widely used in medical settings, where mistakes could have deadly consequences.
For example, one speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella,” but Whisper transcribed, “He too a big piece of a cross, a teeny, small piece … I’m sure he didn’t have a terror knife so he killed a number of people.” Another recording said, “two other girls and one lady,” and the AI tool transcribed this as “two other girls and one lady, um, which were Black.” Lastly, one medical-related example showed Whisper writing down “hyperactivated antibiotics” in its output, which do not exist.
Nabla, an AI assistant used by over 45,000 clinicians
Despite the above news, Nabla, an ambient AI assistant that helps clinicians transcribe the patient-doctor interaction, and create notes or reports after the visit, still uses Whisper. The company claims that over 45,000 clinicians across 85+ health organizations use the tool, including Children’s Hospital Los Angeles and Mankato Clinic in Minnesota.
Even though Nabla is based on OpenAI’s Whisper, the company’s Chief Technology Officer, Martin Raison, says that its tool is fine-tuned in medical language to transcribe and summarize interaction. However, OpenAI recommends against using Whisper for crucial transcriptions, even warning against using it in “decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes.”
The company behind Nabla says that it’s aware of Whisper’s tendency to hallucinate and that it’s already addressing the problem. However, Raison also said that it cannot compare the AI-generated transcript with the original audio recording, as its tool automatically deletes the original audio for data privacy and safety. Fortunately, there’s no recorded complaint yet against a medical provider due to hallucination by their AI notes-taking tools.
Even if that’s the case, William Saunders, a former OpenAI engineer, said that removing the original recording could be problematic, as the healthcare provider wouldn’t be able to verify if the text is correct. “You can’t catch errors if you take away the ground truth,” he told the Associated Press.
Nevertheless, Nabla requires its users to edit and approve transcribed notes. So, if it could deliver the report while the patient is still in the room with the doctor, it would give the healthcare provider a chance to verify the veracity of its results based on their recent memory and even confirm information with the patient if the data delivered by the AI transcription is deemed inaccurate.
This shows that AI isn’t an infallible machine that gets everything right — instead, we can think of it as a person who can think quickly, but its output needs to be double-checked every time. AI is certainly a useful tool in many situations, but we can’t let it do the thinking for us, at least for now.