Hello and welcome to Eye on AI.
Name a profession and there’s almost certainly someone building a generative AI copilot for it. Accountants, lawyers, doctors, architects, financial advisors, marketing copywriters, software programmers, cybersecurity experts, salespeople—there are copilots already in the market for all of these roles.
AI copilots differ from using a general-purpose LLM-based chatbot, like OpenAI’s GPT models, although some use one of those general-purpose models as their central component. Copilots have user interfaces and usually backend processes specifically tailored for the particular tasks that someone in that profession would want assistance with—whether that is crafting an Excel spreadsheet formula for an accountant, or, for a salesperson, figuring out the best wording to convince a customer to close a complex deal. Many copilots rely on a process called RAG—retrieval augmented generation—to boost the accuracy of the information they output and reduce the tendency of LLMs to hallucinate or produce superficially plausible but inaccurate information.
Perhaps no profession save software developers has embraced experimentation with copilots as enthusiastically as the law. There have already been several instances where lawyers—including former Trump lawyer-turned-star-witness-for-the-prosecution Michael Cohen—have been reprimanded and fined by judges for naively (or very lazily) using ChatGPT for legal research and writing without checking the case citations it produced, which in some cases turned out to be completely invented. The legal copilots, however, are supposed to be much better than ChatGPT at completing legal tasks and answering legal questions.
But are they? The answer matters because lawyers’ experience using these copilots may foretell what will happen in other professions too in the coming few years. In that context, a study published last month (and updated Friday) from researchers affiliated with Stanford University’s Human-Centered AI Institute (HAI) sounded an important caution—not just for the legal profession but for copilots as a whole.
The HAI researchers, who included a Stanford Law professor, created a dataset of 200 questions designed to mimic the kinds of questions a lawyer might ask a legal research copilot. The Stanford team claims their questions are a better test of how legal copilots may perform in a real-world setting than bar exam questions—especially because a lot of datasets of bar exam questions have already been memorized by LLMs trained for vast amounts of data scraped from the internet. The dataset includes some particularly tricky questions in which the question includes a false premise. Such questions often lead LLMs astray. Trained to be helpful and agreeable, they frequently accept the false premise and then invent information to justify it, rather than telling the user the premise of the question is wrong.
The researchers then tested several prominent legal research copilots, including one from LexisNexis (Lexis+ AI) and two from Thomson Reuters (Ask Practical Law AI and Westlaw’s AI-Assisted Research) on this dataset. They used OpenAI’s GPT-4 as a kind of control, to see how well an LLM would do without RAG and without any of the other backend processing that had been geared just for legal research. The answers were evaluated by human experts.
For lawyers—and everyone hopeful RAG would eliminate hallucinations—there was a little bit of good news and quite a lot of not-so-good news in the results. The good news is that RAG did indeed reduce hallucination rates significantly. GPT-4 had a hallucination rate of 43% while the worst of the three legal copilots had a hallucination rate of 33%. The bad news is that the hallucination rates were still much higher than you’d want. The best two copilots still made up information in about one out of six instances. Worse still, the RAG-based legal copilots often omitted key information from answers, with between nearly a fifth to well over half of the responses judged by human evaluators as incomplete. By contrast, fewer than one in 10 of GPT-4’s responses failed on this metric. The study also pointed out that LexisNexis’s copilot provided legal citations for all the information it provided, but that sometimes the cases cited did not say what the copilot said they did. The researchers pointed out that this kind of error can be particularly dangerous because the presence of the citation to a real case can make lawyers complacent, making it easier for errors to slip past.
LexisNexis and Thomson Reuters have both said that the accuracy figures in the HAI study were significantly lower than what they’ve found in their own internal performance testing and in feedback from customers. “Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it, and we’ve been very clear with customers that the product can produce inaccuracies,” Mike Dahn, head of Westlaw Product Management at Thomson Reuters, wrote in a blog response to the HAI study.
“LexisNexis has extensive programs and system measures in place to improve the accuracy of responses over time, including the validation of citing authority references to mitigate hallucination risk in our product,” Jeff Pfeifer, LexisNexis chief product officer for the U.S., Canada, Ireland, and the U.K., wrote in a statement provided to newsletter LegalDive.
The blog post HAI wrote to accompany the research pointed to a recent story by Bloomberg Law that also could give people pause. It looked at the experience of Paul Weiss Rifkind Wharton & Garrison—among the 50 largest U.S. law firms with close to 1,000 attorneys—with a legal copilot from the startup Harvey. Paul Weiss told the news organization that it wasn’t using quantitative metrics to assess the copilot because, according to Bloomberg, “the importance of reviewing and verifying the accuracy of the output, including checking the AI’s answers against other sources, makes any efficiency gains difficult to measure.” The copilot’s answers could also be inconsistent—with the same query yielding different results at different times—or extremely sensitive to seemingly inconsequential changes in the wording of a prompt. As a result, Paul Weiss said it wasn’t in a position yet to determine the return on investment from using Harvey.
Instead, Paul Weiss was evaluating the copilots based on qualitative metrics, such as how much attorneys enjoyed using them. And here, there were some interesting anecdotes. It turned out that while junior lawyers might not see much time-savings in using the AI copilot for research because of the need to verify its answers, more senior lawyers found the copilot to be a very useful tool for helping them brainstorm possible legal arguments. The firm also noted that the copilot could do certain things—such as evaluate every single contract in a huge database in minutes—that humans simply could not do. In the past, firms had to rely on some sort of statistical sampling of the contracts, and even then the process might take days or weeks.
Pablo Arredondo, cofounder of CoCounsel, a legal copilot now owned by Thomson Reuters, but which was not part of the HAI study, told me that the HAI study and the Bloomberg story reinforce that all generative AI legal copilots need oversight (as do junior associates at law firms). Some of the areas where the copilots stumbled in the HAI study, such as determining when a case had been overturned subsequently by a higher court, are also areas where different legal research companies often provide conflicting information, he noted.
Taken together, I think the Stanford study and the Bloomberg Law story say a lot about where AI copilots are today and how we should be thinking about where they are heading. Some AI researchers and skeptics of the current hype around generative AI have jumped on the HAI paper as evidence that LLMs were entering the “trough of disillusionment” and that perhaps the entire field is about to enter another “AI winter.” I think that’s not quite right. Yes, the Stanford paper points to serious weaknesses in AI copilots. And yes, RAG will not cure hallucinations. But I think we will find ways to continue to minimize hallucinations (longer context windows is one of them) and that people will continue to use copilots.
The HAI paper makes a great case for rigorous testing—and for that performance data to be shared with users. Professionals must have a clear sense of copilots’ capabilities and weaknesses and need to understand how they are likely to fail. Having this mental model of how a particular copilot works is essential for any professional working alongside one. Also, as the Bloomberg Law story suggests, many professionals will come to find copilots useful and helpful, even in cases when they aren’t entirely accurate—and that the efficiency gains from such a system may be hard to evaluate. It’s not about whether the copilot can do well enough on its own to replace human workers. It’s about whether the human working with the copilot can perform better than they could on their own—just as in the case of the senior Paul Weiss lawyers who said it helped them think through legal arguments.
Arredondo said that Thomson Reuters is in early discussions with Stanford to form a consortium of legal tech firms and law firms to partner along with other academic institutions, to develop and maintain benchmarking for legal copilots. He said that ideally, these standards would compare how human lawyers perform on these same tests and then see how they perform when assisted by AI tools, as opposed to evaluating the systems only against one another and without the human oversight they still need.
We don’t have very good benchmarks for human-AI teaming. It’s time to create some.
There’s more AI news below...But first, if you want to find out more about working alongside AI copilots, I’ve got some news of my own: My book Mastering AI: A Survival Guide to Our Superpowered Future is now available for pre-order in the U.S. and the U.K.! The book has a chapter on how AI will transform the way we work. But Mastering AI goes well beyond that to reveal how AI will change and challenge our democracy, our society, and even ourselves. AI presents tremendous opportunities in science, education, and business, but we must urgently address the substantial risks this technology poses. In Mastering AI I explain how. If you enjoy this newsletter, I know you’ll find the book valuable. Please consider pre-ordering your copy today.
Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn
Correction, June 4: An earlier version of this story misspelled the full name of the law firm Paul Weiss Rifkind Wharton & Garrison.