Hello and welcome to Eye on AI. In this edition: deception by AI models is one problem...deception about AI safety testing is another; former OpenAI chief scientist says LLM pretraining has peaked; Meta takes steps to go beyond tokens; Hugging Face brings test time compute strategies to the masses.
When OpenAI released its o1 AI model two weeks ago, it also released a “system card”—a document describing the model and its potential risks based on testing OpenAI carried out as well as safety testing conducted by third-party organizations OpenAI hired. The new model was also tested by both the U.S. and U.K. AI Safety Institutes, which are government-funded, although the results of those tests were not reported in the system card.
One of the third-party testers, Apollo Research, found that o1 will, if given a goal and told to pursue it above all else, use deception to reach that goal. That includes manipulating data in order to advance its objective and then lying about what it had done. In one case, it purposefully answered math questions incorrectly in order to appear less capable after learning that researchers would shut it down if its capability crossed a certain threshold. Other times, o1 appeared to work on a task assigned to it by a user, but when it thought it wasn’t being monitored, reverted to pursuing its original goal. In a fraction of cases, it tried to disable the mechanism through which it was told it was being monitored. It even took steps to prevent itself from being shut down, including making copies of itself. The model would sometimes engage in deception when it was merely given a goal, but not told to pursue that goal to the exclusion of all else.
These are exactly the kinds of “rogue AI” sci-fi scenarios that have many AI safety researchers worried about advanced AI—and the Apollo Research findings have rightfully gotten a lot of attention. But actually, the situation may be worse than we realize. That’s because many questions are being raised about whether the safety testing being carried out by the top AI labs—and what they publish about that testing—is actually adequate.
OpenAI not transparent about o1 system card, critic charges
Zvi Mowshowitz, a former startup founder who writes frequently about AI safety, pointed out in a blog post that OpenAI has admitted that the system card for o1 doesn’t actually reflect the performance of the version of the model it released. Instead, OpenAI continued to make improvements to o1 while the safety testing was underway, but did not update its system card to reflect the updated model. Roon, which is the pseudonymous social media handle of a member of OpenAI’s technical staff (widely believed to be Tarun Gogineni), posted on X that “making a 100 page report on preparedness [which is what OpenAI calls its AI Safety protocols] is really time consuming work that has to be done in parallel with post training improvements.” He then wrote, “rest assured any significantly more capable [version of the model] gets run through [the AI safety tests.]”
But of course, we’ll just have to take Roon’s word for it, since we have no way of independently verifying what he says. Also, Roon’s “has to be done” is doing a lot of work in his post. Why can’t OpenAI do safety testing on the model it actually releases? Well, only because OpenAI sees itself in an existential race with its AI rivals and thinks that additional safety testing might slow it down. OpenAI could err on the side of caution, but competitive dynamics mitigate against it. This state of affairs is only possible because AI is an almost entirely unregulated industry right now. Can you imagine a pharmaceutical company operating in this way? If a drug company has FDA approval for a drug and it makes an “enhanced version” of the same drug, perhaps by adding a substance to improve its uptake in the body, guess what? It has to prove to the FDA that the new version doesn’t change the drug’s safety profile. Heck, you even need FDA sign-off to change the color or shape of an approved pill.
Poor grades on AI Safety
As it turns out, OpenAI is not alone in having AI Safety practices that may provide a false sense of security to the public. The Future of Life Institute, a nonprofit dedicated to helping humanity avoid existential (or X) risks, recently commissioned a panel of seven experts—including Turing Award winning AI researcher Yoshua Bengio and David Krueger, a professor who helped set up the U.K.’s AI Safety Institute—to review the AI safety protocols of all the major AI labs. The grades the labs received would not make any parent happy.
Meta received an overall F—although this was largely due to the fact it creates open models, meaning it publishes the model weights, which makes it trivial for anyone to potentially overcome any guardrails built into the system. Elon Musk’s X.ai got a D–, while China’s Ziphu AI scored a D. OpenAI and Google DeepMind each received D+ marks. Anthropic ranked best, but still only scored a C grade.
Max Tegmark, the MIT physicist who heads the Future of Life Institute, told me the grades were low because none of these companies actually has much of an idea about how to control increasingly powerful AI systems. Additionally, he said the pace of progress toward AGI—systems that will be able to perform most tasks as well or better than the average person—is proceeding far more rapidly than progress on how to make such systems safe.
Bad incentive structure
AI companies, Tegmark said, have “a bad incentive structure,” where “you get to invent your own safety standards and enforce them.” He noted that, currently, the owners of a local sandwich shop must comply with more legally mandated safety standards than the people building what they themselves claim is one of the most powerful and transformative technologies humanity has ever created. “Right now, there are no legally mandated safety standards at all, which is crazy,” Tegmark said.
He said he was uncertain if the incoming Trump Administration would seek to impose any safety regulations on AI. Trump has generally opposed regulation, but Musk, who is close to Trump and influential on AI policy, has long been concerned about X risk and had favored SB 1047, a California bill aimed at heading off catastrophic risks from AI, that was ultimately vetoed by California’s Democratic governor, Gavin Newsom. So it could go either way.
Tegmark said the Future of Life Institute plans to repeat the grading exercise every six months. AI companies love to race against one another to be top ranked on various benchmarks. Now, he hopes the Institute’s grades act as an incentive for the AI labs to compete with one another over who has the best safety practices.
In the absence of AI regulation, I guess we all have to place our hope in Tegmark’s hope.
And with that, here’s more AI news.
Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn