Mathematicians have stumped the most advanced generative artificial intelligence (AI) models with a series of mind-bending new math problems.
These problems typically require doctorate-level mathematicians hours to days to solve, according to the research institute Epoch AI. But in the new tests, the most advanced AI models on the market got correct answers on less than 2% of these problems.
In the past decade, a number of AI tests have been developed to determine whether the answers these models return are actually correct. In many cases, AI models now breeze through these benchmarks.
For example, in the commonly used Measuring Massive Multitask Language Understanding (MMLU) benchmark test, today's AI models answer 98% of math problems correctly.
Most of these benchmarks are geared toward testing AI's ability to do high-school and college-level math, Elliot Glazer, a mathematician at Epoch AI, and colleagues wrote in a new paper posted on the preprint database arXiv. (The paper has not yet been peer-reviewed or published in a scientific journal.)
The new set of benchmarks, called FrontierMath, aims for a higher level of reasoning. Epoch AI developed the questions with the help of mathematics professors, including some winners of the Fields Medal, perhaps the most prestigious prize in math. The problems cover a wide range of subfields, from number theory to algebraic geometry, and are available on Epoch AI's website.
"These are extremely challenging," 2006 Fields Medal winner Terence Tao, a mathematician at UCLA, wrote in a review of the problems for Epoch AI. "I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages."
The problems were also unique — a step taken to ensure that none of the problems were already in the AI models' training data. When complex reasoning problems are included in the training data, the AI may appear to solve the problems, but in reality, it already has a "cheat sheet," since it has been trained on the answers.
The researchers tested six state-of-the-art AI models: Google's Gemini 1.5 Pro (002), Anthropic's Claude 3.5 Sonnet, OpenAI's o1-preview, o1-mini, and GPT4o and xAI's Grok-2 Beta. Gemini and Claude managed to solve 2%, which was just slightly better than the showings from o1-preview, o1-mini and GPT-4o's 1%. Grok-2 Beta failed to get any problems right.
However, these rankings are misleading because the low success rate means that a single right answer can have an outsize impact on each model's overall score, the researchers cautioned.
"[E]ven when a model obtained the correct answer, this does not mean that its reasoning was correct," the paper authors wrote. "For instance, on one of these problems running a few simple simulations was sufficient to make accurate guesses without any deeper mathematical understanding. However, models' low overall accuracy shows that such guessing strategies do not work on the overwhelming majority of FrontierMath problems."
The findings show that right now, AI models don't possess research-level math reasoning, Epoch AI's collaborators concluded. However, as AI models advance, these benchmark tests will provide a way to find out if their reasoning abilities are deepening.
"By regularly evaluating state-of-the-art models and collaborating with the AI research community," the team wrote in the statement, "we aim to deepen our understanding of AI’s capabilities and limitations."