Hello and welcome to Eye on AI. In this edition…Watch OpenAI’s hands; Trump scraps Biden’s AI order; Whistleblower targets Amazon and Covariant; Titans vs. Transformers.
OpenAI may or may not be about to release something big and agentic.
According to a rather breathless Axios article on Sunday, an unidentified company is preparing “Ph.D.-level super-agents” that would be “a true replacement for human workers.” No names are named, but the article prominently notes that OpenAI CEO Sam Altman will give Trump administration officials a closed-door briefing at the end of the month.
It goes on to add: “Sources say this coming advancement is significant. Several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress.” Those sources apparently come from “the U.S. government and leading AI companies.”
There’s more than a whiff of hype about all this. But Altman is no fan of such things, he claims. Addressing the separate but perhaps connected issue of OpenAI’s efforts to achieve “artificial general intelligence” (definitions differ, but this usually means AI with human- or superhuman-level capabilities), the CEO tweeted yesterday that “Twitter hype is out of control again” and “we are not gonna deploy AGI next month, nor have we built it.”
If he’s so anti-hype, Altman might want to take himself aside for tweeting, less than three weeks ago: “I have always wanted to write a six-word story. Here it is: Near the singularity; unclear which side.” A story, sure, but it also came across as a strong hint. ("The singularity" is a term referring to the inflection point where AI surpasses human intelligence.)
In yesterday's tweet, Altman promised “We have some very cool stuff for you.” I’ve asked OpenAI whether it is the company that’s about to reveal “Ph.D.-level super-agents” and have received no response. But The Information reports that OpenAI will launch an agentic system called Operator, which can autonomously execute tasks on the user’s behalf, as soon as this month.
Whatever OpenAI does release, people should scrutinize it very closely, because the company has in recent days been caught up in a bit of a benchmarking scandal that raises questions about its performance claims.
The benchmark in question is FrontierMath, which was used in the demonstration of OpenAI’s flagship o3 model a month back. Curated by Epoch AI, FrontierMath contains only “new and unpublished” math problems, which is supposed to avoid the issue of a model being asked to solve problems that were included in its training dataset. Epoch AI says models such as OpenAI’s GPT-4 and Google’s Gemini only manage scores of less than 2%. In its demo, o3 scored a shade over 25%.
Problem is, it turns out that OpenAI funded the development of FrontierMath and apparently instructed Epoch AI not to tell anyone about this, until the day of o3’s unveiling. After an Epoch AI contractor used a LessWrong post to complain that mathematicians contributing to the dataset had been kept in the dark about the link, Epoch associate director Tamay Besiroglu apologized, saying OpenAI’s contract had left the company unable to disclose the funding earlier.
“We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities,” Besiroglu wrote. “However, we have a verbal agreement that these materials will not be used in model training.”
OpenAI has not yet responded to a question about whether it nonetheless used its FrontierMath access when training o3—but its critics aren’t holding back. “The public presentation of o3 from a scientific perspective was manipulative and disgraceful,” the notable AGI skeptic Gary Marcus told my colleague Jeremy Kahn in Davos yesterday, adding that the presentation was “deliberately structured to make it look like they were closer to AGI than they actually are.”
“OpenAI should be more transparent about what the business arrangements were [with Epoch AI] and the extent to which they were given a competitive advantage and the extent to which they trained directly or indirectly on materials they had access to and the extent to which they used data augmentation techniques on information they had access to,” Marcus said. “If they are not transparent, we should not take them seriously.”
That’s something to bear in mind over the coming weeks. And with that, here’s more on what has been a very busy few days on the AI news front.
David Meyer
david.meyer@fortune.com
@dmeyer.eu on Bluesky