Get all your news in one place.
100’s of premium titles.
One app.
Start reading
LiveScience
LiveScience
Stephanie Pappas

Mathematicians devised novel problems to challenge advanced AIs' reasoning skills — and they failed almost every test

Equations shown in a digital format.

Mathematicians have stumped the most advanced generative artificial intelligence (AI) models with a series of mind-bending new math problems.

These problems typically require doctorate-level mathematicians hours to days to solve, according to the research institute Epoch AI. But in the new tests, the most advanced AI models on the market got correct answers on less than 2% of these problems.

In the past decade, a number of AI tests have been developed to determine whether the answers these models return are actually correct. In many cases, AI models now breeze through these benchmarks.

For example, in the commonly used Measuring Massive Multitask Language Understanding (MMLU) benchmark test, today's AI models answer 98% of math problems correctly.

Most of these benchmarks are geared toward testing AI's ability to do high-school and college-level math, Elliot Glazer, a mathematician at Epoch AI, and colleagues wrote in a new paper posted on the preprint database arXiv. (The paper has not yet been peer-reviewed or published in a scientific journal.)

Related: Scientists design new 'AGI benchmark' that indicates whether any future AI model could cause 'catastrophic harm'

The new set of benchmarks, called FrontierMath, aims for a higher level of reasoning. Epoch AI developed the questions with the help of mathematics professors, including some winners of the Fields Medal, perhaps the most prestigious prize in math. The problems cover a wide range of subfields, from number theory to algebraic geometry, and are available on Epoch AI's website.

"These are extremely challenging," 2006 Fields Medal winner Terence Tao, a mathematician at UCLA, wrote in a review of the problems for Epoch AI. "I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages."

The problems were also unique — a step taken to ensure that none of the problems were already in the AI models' training data. When complex reasoning problems are included in the training data, the AI may appear to solve the problems, but in reality, it already has a "cheat sheet," since it has been trained on the answers.

The researchers tested six state-of-the-art AI models: Google's Gemini 1.5 Pro (002), Anthropic's Claude 3.5 Sonnet, OpenAI's o1-preview, o1-mini, and GPT4o and xAI's Grok-2 Beta. Gemini and Claude managed to solve 2%, which was just slightly better than the showings from o1-preview, o1-mini and GPT-4o's 1%. Grok-2 Beta failed to get any problems right.

However, these rankings are misleading because the low success rate means that a single right answer can have an outsize impact on each model's overall score, the researchers cautioned.

"[E]ven when a model obtained the correct answer, this does not mean that its reasoning was correct," the paper authors wrote. "For instance, on one of these problems running a few simple simulations was sufficient to make accurate guesses without any deeper mathematical understanding. However, models' low overall accuracy shows that such guessing strategies do not work on the overwhelming majority of FrontierMath problems."

The findings show that right now, AI models don't possess research-level math reasoning, Epoch AI's collaborators concluded. However, as AI models advance, these benchmark tests will provide a way to find out if their reasoning abilities are deepening.

"By regularly evaluating state-of-the-art models and collaborating with the AI research community," the team wrote in the statement, "we aim to deepen our understanding of AI’s capabilities and limitations."

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.