Artificial intelligence is worse than humans in every way at summarising documents and might actually create additional work for people, a government trial of the technology has found.
Amazon conducted the test earlier this year for Australia’s corporate regulator the Securities and Investments Commission (ASIC) using submissions made to an inquiry. The outcome of the trial was revealed in an answer to a questions on notice at the Senate select committee on adopting artificial intelligence.
The test involved testing generative AI models before selecting one to ingest five submissions from a parliamentary inquiry into audit and consultancy firms. The most promising model, Meta’s open source model Llama2-70B, was prompted to summarise the submissions with a focus on ASIC mentions, recommendations, references to more regulation, and to include the page references and context.
Ten ASIC staff, of varying levels of seniority, were also given the same task with similar prompts. Then, a group of reviewers blindly assessed the summaries produced by both humans and AI for coherency, length, ASIC references, regulation references and for identifying recommendations. They were unaware that this exercise involved AI at all.
These reviewers overwhelmingly found that the human summaries beat out their AI competitors on every criteria and on every submission, scoring an 81% on an internal rubric compared with the machine’s 47%.
Human summaries ran up the score by significantly outperforming on identifying references to ASIC documents in the long document, a type of task that the report notes is a “notoriously hard task” for this type of AI. But humans still beat the technology across the board.
Reviewers told the report’s authors that AI summaries often missed emphasis, nuance and context; included incorrect information or missed relevant information; and sometimes focused on auxiliary points or introduced irrelevant information. Three of the five reviewers said they guessed that they were reviewing AI content.
The reviewers’ overall feedback was that they felt AI summaries may be counterproductive and create further work because of the need to fact-check and refer to original submissions which communicated the message better and more concisely.
The report mentions some limitations and context to this study: the model used has already been superseded by one with further capabilities which may improve its ability to summarise information, and that Amazon increased the model’s performance by refining its prompts and inputs, suggesting that there are further improvements that are possible. It includes optimism that this task may one day be competently undertaken by machines.
But until then, the trial showed that a human’s ability to parse and critically analyse information is unparalleled by AI, the report said.
“This finding also supports the view that GenAI should be positioned as a tool to augment and not replace human tasks,” the report concluded.
Greens Senator David Shoebridge, whose question to ASIC prompted the publishing of the report, said that it was “hardly surprising” that humans were better than AI at this task. He also said it raised questions about how the public might feel about using AI to read their inquiry submissions.
“This of course doesn’t mean there is never a role for AI in assessing submissions, but if it has a role it must be transparent and supportive of human assessments and not stand-alone,” he said.
“It’s good to see government departments undertaking considered exercises like this for AI use, but it would be better if it was then proactively and routinely disclosed rather than needing to be requested in Senate committee hearings.”