Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Sage Lazzaro

Anthropic releases its proposal for standardizing AI model red-teaming

Anthropic cofounder and CEO Dario Amodei wearing a white shirt and blue jacket, photographed against a red backdrop, gesturing with his hands while on stage. (Credit: Chesnot/Getty Images)

Anthropic, the buzzy San Francisco-based AI startup founded by researchers who broke away from OpenAI, yesterday published an overview of how it’s been red-teaming its AI models, outlining four approaches and the advantages and disadvantages of each. Red teaming, of course, is the security practice of attacking your own system in order to uncover and address potential security vulnerabilities. For AI models, it goes a step further and involves exploring creative ways someone may intentionally or unintentionally misuse the software.  

Red teaming has also taken a prominent role in discussions of AI regulation. The very first directive in the Biden administration’s AI executive order mandates that companies developing high-risk foundation models notify the government during training and share all red teaming results. The recently enacted EU AI Act also contains requirements around providing information from red teaming. 

As lawmakers rally around red teaming as a way to ensure powerful AI models are developed safely, it certainly deserves a close eye. There’s a lot of talk about the results of red teaming, but not as much talk about how that red teaming is conducted. As Anthropic states in its findings, there’s a lack of standardization in red-teaming practices, which hinders our ability to contextualize results and objectively compare models.

Anthropic, which has a close partnership with Amazon, concludes its blog post with a series of red-teaming policy recommendations, including suggestions to fund and “encourage” third-party red teaming. Anthropic also suggests AI companies should create clear policies tying the scaling of development and release of new models with red teaming results. Through these suggestions, the company is weighing in on a running debate about the best practices for AI red teaming and the trade-offs associated with various levels of disclosure. Sharing findings enhances our understanding of models, but some worry publicizing vulnerabilities will only empower adversaries.

Anthropic’s approaches, as outlined in the blog post, include using language models to red team, red teaming in multiple modalities, “domain-specific, expert red teaming,” and “open-ended, general red teaming.”

The domain-specific red teaming is particularly interesting, as it includes testing for high-risk trust and safety risks, national security risks, and region-specific risks that may involve cultural nuances or multiple languages. Across all of these areas, Anthropic highlights depth as a significant benefit: Having the most knowledgeable experts extensively investigate specific threats can turn up really nuanced concerns that might otherwise be missed. At the same time, this approach is hard to scale, doesn’t cover a lot of ground, and often turns up isolated model failures that while potentially significant, are challenging to address and don’t necessarily tell us very much about the model’s likely safety in most real-world deployments..

Using AI language models to red team other AI language models, on the other hand, allows for quick iteration and makes it easier to test for a wide range of risks, Anthropic says. 

“To do this, we employ a red team / blue team dynamic, where we use a model to generate attacks that are likely to elicit the target behavior (red team) and then fine-tune a model on those red-teamed outputs in order to make it more robust to similar types of attack (blue team),” reads the blog post. 

Multi-modal red teaming is becoming necessary simply because models are increasingly being trained on and built to output multiple modalities, including text, images, video, and code. Lastly, Anthropic describes open-ended, general red teaming such as crowdsourced red teaming efforts and red teaming events and challenges. These more communal approaches to open-ended red teaming have the longest lists of benefits versus challenges. Many of the pros revolve around benefits to the participant, such as it being an educational opportunity and a way to involve the public. And while these techniques can identify potential risks and help harden systems against abuse, they both offer a lot more breadth than depth, according to Anthropic. 

Looking at all these techniques together, it’s hard to imagine how red teaming could be successful without each and every one. It’s also easy to see why different approaches to red teaming can turn up such different findings and why standards are becoming ever more important. 

In his executive order, Biden also ordered the National Institute of Standards and Technology to create “rigorous standards for extensive red-team testing to ensure safety before public release.” Those standards have yet to arrive, and there’s no indication when they will. With new, more powerful models being released every day without transparency into their development or risks, they can’t come soon enough. 

Now, here’s some more AI news.

Sage Lazzaro
sage.lazzaro@consultant.fortune.com
sagelazzaro.com

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.