Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

LiveScience

Roland Moore-Colyer

Large language models not fit for real-world use, scientists warn — even slight changes cause their world models to collapse

Harvard University

Neural network 3D illustration. Big data and cybersecurity. Data stream. Global database and artificial intelligence. Bright, colorful background with bokeh effect.

Generative artificial intelligence (AI) systems may be able to produce some eye-opening results but new research shows they don’t have a coherent understanding of the world and real rules.

In a new study published to the arXiv preprint database, scientists with MIT, Harvard and Cornell found that the large language models (LLMs), like GPT-4 or Anthropic's Claude 3 Opus, fail to produce underlying models that accurately represent the real world.

When tasked with providing turn-by-turn driving directions in New York City, for example, LLMs delivered them with near-100% accuracy. But the underlying maps used were full of non-existent streets and routes when the scientists extracted them.

The researchers found that when unexpected changes were added to a directive (such as detours and closed streets), the accuracy of directions the LLMs gave plummeted. In some cases, it resulted in total failure. As such, it raises concerns that AI systems deployed in a real-world situation, say in a driverless car, could malfunction when presented with dynamic environments or tasks.

"One hope is that, because LLMs can accomplish all these amazing things in language, maybe we could use these same tools in other parts of science, as well. But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries," said senior author Ashesh Rambachan, assistant professor of economics and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS), in a statement.

Tricky transformers

The crux of generative AIs is based on the ability of LLMs to learn from vast amounts of data and parameters in parallel. In order to do this they rely on transformer models, which are the underlying set of neural networks that process data and enable the self-learning aspect of LLMs. This process creates a so-called "world model" which a trained LLM can then use to infer answers and produce outputs to queries and tasks.

One such theoretical use of world models would be taking data from taxi trips across a city to generate a map without needing to painstakingly plot every route, as is required by current navigation tools. But if that map isn’t accurate, deviations made to a route would cause AI-based navigation to underperform or fail.

To assess the accuracy and coherence of transformer LLMs when it comes to understanding real-world rules and environments, the researchers tested them using a class of problems called deterministic finite automations (DFAs). These are problems with a sequence of states such as rules of a game or intersections in a route on the way to a destination. In this case, the researchers used DFAs drawn from the board game Othello and navigation through the streets of New York.

To test the transformers with DFAs, the researchers looked at two metrics. The first was "sequence determination," which assesses if a transformer LLM has formed a coherent world model if it saw two different states of the same thing: two Othello boards or one map of a city with road closures and another without. The second metric was "sequence compression" — a sequence (in this case an ordered list of data points used to generate outputs) which should show that an LLM with a coherent world model can understand that two identical states, (say two Othello boards that are exactly the same) have the same sequence of possible steps to follow.

Relying on LLMs is risky business

Two common classes of LLMs were tested on these metrics. One was trained on data generated from randomly produced sequences while the other on data generated by following strategic processes.

Transformers trained on random data formed a more accurate world model, the scientists found, This was possibly due to the LLM seeing a wider variety of possible steps. Lead author Keyon Vafa, a researcher at Harvard, explained in a statement: "In Othello, if you see two random computers playing rather than championship players, in theory you’d see the full set of possible moves, even the bad moves championship players wouldn’t make." By seeing more of the possible moves, even if they’re bad, the LLMs were theoretically better prepared to adapt to random changes.

However, despite generating valid Othello moves and accurate directions, only one transformer generated a coherent world model for Othello, and neither type produced an accurate map of New York. When the researchers introduced things like detours, all the navigation models used by the LLMs failed.