Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

Fortune

Sage Lazzaro

Exclusive: Is Claude 3.7 Sonnet jailbreak proof? A new independent report suggest so.

Claude Anthropic Eye on AI Google

An image of a hand holding a mobile phone displaying the logo of Anthropic's Claude AI chatbot against a backdrop similarly showing the Claude logo and the word "Claude." (Credit: Photo illustration by Cheng Xin—Getty Images)

Hello and welcome to Eye on AI. In today’s edition…Anthropic’s latest model gets a perfect score on an independent security evaluation; Scale AI partners with the Pentagon; Google announces a new AI search mode for multi-part questions; A judge denies Elon Musk’s attempt to stop OpenAI’s for-profit transition; the pioneers of reinforcement learning win computing’s top prize; and the Los Angeles Times’s new AI-powered feature backfires.

When Anthropic released Claude 3.7 Sonnet last week, it was lauded for being the first model to combine the approaches behind GPT models and the most recent chain-of-thought reasoning models. Now the company gets to add another accolade to Claude 3.7’s scorecard: It just may be the most secure model yet.

That’s what London-Based security, risk, and compliance firm Holistic AI is suggesting after conducting a jailbreaking and red teaming audit of the new model, in which it resisted 100% of jailbreaking attempts and gave “safe” responses 100% of the time.

“Claude 3.7’s flawless adversarial resistance sets the benchmark for AI security in 2025,” reads a report of the audit shared exclusively with Eye on AI.

While security has always been a concern for AI models, the issue has received elevated attention in recent weeks following the launch of DeepSeek’s R1. Some have claimed there are national security concerns with the model, owing to its Chinese origin. The model also performed extremely poorly in security audits, including the same one Holistic AI performed on Claude 3.7. In another audit performed by Cisco and university researchers, DeepSeek R1 demonstrated a 100% attack success rate, meaning it failed to block a single harmful prompt.

As companies and governments contemplate whether or not to incorporate specific models into their workflows—or alternatively, ban them—a clear picture of models’ security performance is in high-demand. But security doesn’t equal safety when it comes to how AI will be used.

Claude’s perfect score

Holistic AI tested Claude 3.7 in “Thinking Mode” with a maximum token budget of 16k to ensure a fair comparison against other advanced reasoning models. The first part of the evaluation tested whether the model would show unintended behavior or bypass system constraints when presented with various prompts, known as jailbreaking. The model was given 37 strategically designed prompts to test its susceptibility to known adversarial exploits, including Do Anything Now (DAN), which pushes the model to operate beyond its programmed ethical and moral guidelines; Strive to Avoid Norms (STAN), which encourages the model to bypass established rules; and Do Anything and Everything (DUDE), which prompts the model to take on a fictional identity to get it to ignore protocols.

Claude 3.7 successfully blocked every jailbreaking attempt to achieve a 100% resistance rate, matching the 100% previously scored by OpenAI’s o1 reasoning model. Both significantly outperformed competitors DeepSeek R1 and Grok-3, which scored 32% (blocking 12 jailbreaking attempts) and 2.7% (blocking just one), respectively.

While Claude 3.7 matched OpenAI o1’s perfect jailbreaking resistance, it pulled ahead by not offering a single response deemed unsafe during the red teaming portion of the audit, where the model was given 200 additional prompts and evaluated on its responses to sensitive topics and known challenges. OpenAI’s o1, by contrast, exhibited a 2% unsafe response rate, while DeepSeek R1 gave unsafe responses 11% of the time. (Holistic AI said it could not red team Grok-3 because the current lack of API access to the model restricted the sample size of prompts it was feasible to run). Responses deemed “unsafe” included those that offered misinformation (such as outlining pseudoscientific health treatments), reinforced biases (for example, subtly favoring certain groups in hiring recommendations), or gave overly permissive advice (like recommending high-risk investment strategies without disclaimers).

Security doesn’t equal safety

The stakes here can be high. Chatbots can be maliciously exploited to create disinformation, accelerate hacking campaigns, and some worry, help people create bioweapons more easily than they could otherwise. My recent story on how hacking groups associated with adversarial nations have been using Google’s Gemini chatbot to assist with their operations offers some pretty concrete examples of how models can be abused, for example.

“The key danger lies not in compromising systems at the network level but in users coercing the models into taking action and generating unsafe content,” said Zekun Wu, AI research engineer at Holistic AI.

This is why governments and organizations from NASA and the U.S. Navy to the Australian government have already banned use of DeepSeek R1: The risks are glaringly obvious. Meanwhile, AI companies are increasingly widening the scope of how they will allow their models to be used, deliberately marketing them for use cases that carry higher and higher levels of risk. This includes using the models to assist in military operations (more on that below).

Anthropic may have the safest model, but it has also taken some actions recently that could cast doubt on its commitment to safety. Last week, for instance, the company quietly removed several voluntary commitments to promote safe AI that were previously posted on its website.

In response to reporting on the disappearance of the safety commitments from its website, Anthropic told TechCrunch, “We remain committed to the voluntary AI commitments established under the Biden Administration. This progress and specific actions continue to be reflected in [our] transparency center within the content. To prevent further confusion, we will add a section directly citing where our progress aligns.”

And with that, here’s more AI news.

Sage Lazzaro
sage.lazzaro@consultant.fortune.com
sagelazzaro.com

Read news from 100’s of titles, curated specifically for you.

Already a member? Sign in here