Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
Jeremy Kahn

OpenAI claims a new method for training chatbots is a breakthrough. Actually, it's a setback.

OpenAI CEO Sam Altman (Credit: Jack Guez—AFP via Getty Images)

Instruction-following large language models, such as OpenAI’s ChatGPT, and rival systems such as Google’s Bard and Anthropic’s Claude, have the potential to revolutionize business. But a lot of companies are struggling to figure out how to use them. That’s primarily because they are unreliable and prone to providing authoritative-sounding but inaccurate information. It's also because the content these A.I. modes generate can pose risks. They can output toxic language or encourages users to engage in unsafe or illegal behavior. They can reveal data that companies wish to safeguard. Dozens of companies are racing to figure out how to solve this problem—and there’s a pot of gold for whoever gets there first.

Last week, OpenAI published a research paper and an accompanying blog post championing what it said was a potentially major step forward towards that goal, as well as towards solving the larger “alignment problem.” The “alignment problem” refers to how to imbue powerful A.I. systems with an understanding of human concepts and values. Researchers who work in the field known as “A.I. Safety” see it as critical to ensuring that future A.I. software won’t pose an extinction-level threat to humanity. But, as I’ll explain, I think the solution OpenAI proposes actually demonstrates how limited today’s large language models are. Unless we come up with a fundamentally different architecture for generative A.I., it is likely that the tension between “alignment” and “performance” will mean the technology never lives up to its full potential. In fact, one could argue that training LLMs in the way OpenAI suggests in its latest research is a step backwards for the field.

To explain why, let’s walk through what OpenAI’s latest research showed. First, you need to understand that one way researchers have tried to tame the wild outputs of large language models is through a process called reinforcement learning from human feedback (or RLHF for short). This means that humans rate the answers that an LLM produces, usually just a simple thumbs up or thumbs down (although some people have experimented with less binary feedback systems) and the LLM is then fine-tuned to produce answers that are more likely to be rated thumbs up. Another way to get LLMs to produce better quality answers, especially for tasks such as logic questions or mathematics, is to ask the LLM to “reason step by step” or “think step by step” instead of just producing a final answer. Exactly why this so-called “chain of thought” prompting works isn’t fully understood, but it does seem to consistently produce better results.

What OpenAI did in its latest research was to see what happened when an LLM was told to use chain of thought reasoning and was also trained using RLHF on each of the logical steps in the chain (instead of on the final answer). OpenAI called this “process supervision” as opposed to the “outcome supervision” it has used before. Well, it turns out, perhaps not surprisingly, that giving feedback on each step produces much better results. You can think of this as similar to how your junior high math teacher always admonished you to “show your work” on exams. That way she could see you if understood the reasoning needed to solve the question, and could give you partial credit even if you made a simple arithmetic error somewhere in the process.

There are just a couple of problems. One, as some other researchers have pointed out, it isn’t clear if this “process supervision” will help with the whole range of hallucinations LLMs exhibit, especially those involving nonexistent quotations and inaccurate citations, or if it only addresses a subset of inaccuracies that involve logic. It’s becoming increasingly clear that trying to align LLMs to avoid many of the undesirable outcomes businesses are afraid may need to involve a much more fundamental rethink of how these models are built and trained.

In fact, a group of Israeli computer scientists from Hebrew University and AI21 Labs, recently explored whether RLHF was a robust alignment method and found serious problems. In a paper published this month, the researchers said that they had proved that for any behavior that an A.I. model could exhibit, no matter how unlikely, there existed a prompt that could elicit that behavior, with less likely behaviors simply requiring longer prompts. “This implies that any alignment process that attenuates undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks,” the researchers wrote. What’s worse, they found that techniques such as RLHF actually made it easier to nudge a model into exhibiting undesired behavior, not less likely.

There’s also a much bigger problem. Even if this technique is successful, it ultimately limits, not enhances, what A.I. can do: In fact, it risks throwing away the genius of Move 37. What do I mean? In 2016, AlphaGo, an A.I. system created by what is now Google DeepMind achieved a major milestone in computer science when it beat the world’s top human player at the ancient strategy board game Go in a best-of-five demonstration match. In the second game of that contest, on the game’s 37th move, AlphaGo placed a stone so unusually and, to human Go experts, so counterintuitively, that almost everyone assumed it was an error. AlphaGo itself estimated that there was a less than one in ten thousand chance that a human would ever play that move. But AlphaGo also predicted that the move would put it in an excellent position to win the game, which it did. Move 37 wasn’t an error. It was a stroke of genius.

Later, when experts analyzed AlphaGo’s play over hundreds of games, they came to see that it had discovered a way of playing that upended 1,000 years of human expertise and intuition about the best Go strategies. Similarly, another system created by DeepMind, Alpha Zero, which could ace a variety of different strategy games, played chess in a style that seemed, to human grandmasters, so bizarre yet so effective, that some branded it “alien chess". In general, it was willing to sacrifice supposedly high-value pieces in order to gain board position in a way that made human players queasy. Like AlphaGo, AlphaZero was trained using reinforcement learning, playing millions of games against itself, where the only reward it received was whether it won or lost.

In other words, AlphaGo and AlphaZero received no feedback from human experts on whether any interim step they took was positive or negative. As a result, the A.I. software was able to explore all kinds of strategies unbiasedly by the limitations of existing human understanding of the game. If AlphaGo had received process supervision from human feedback, as OpenAI is positing for LLMs, a human expert almost certainly would have given Move 37 a thumbs down. After all, human Go masters judged Move 37 illogical. It turned out to be brilliant. And that is the problem with OpenAI’s suggested approach. It is ultimately a kluge—a crude workaround designed to paper over a problem that is fundamental to the design of LLMs.

Today’s generative A.I. systems are very good at pastiche. They regurgitate and remix human knowledge. But, if, what we really want are A.I. systems that can help us solve the toughest problems we face—from climate change to disease—then what we need is not simply a masala of old ideas, but fundamentally new ones. We want A.I. that can ultimately advance novel hypotheses, make scientific breakthroughs, and invent new tactics and methods. Process supervision with human feedback is likely to be detrimental to achieving that goal. We will wind up with A.I. systems that are well-aligned, but incapable of genius.

With that, here’s the rest of this week’s news in A.I.


But, before you read on: Do you want to hear from some of the most important players shaping the generative A.I. revolution and learn how companies are using the technology to reinvent their businesses? Of course, you do! So come to Fortune’s Brainstorm Tech 2023 conference, July 10-12 in Park City, Utah. I’ll be interviewing Anthropic CEO Dario Amodei on building A.I. we can trust and Microsoft corporate vice president Jordi Ribas on how A.I. is transforming Bing and search. We’ll also hear from Antonio Neri, CEO of Hewlett Packard Enterprise, on how the company is unlocking A.I.’s promise, Arati Prabhakar, director of the White House’s Office of Science and Technology Policy on the Biden Administration’s latest thoughts about the U.S. can realize A.I.’s potential, while enacting the regulation needed to ensure we guard against its significant risks, Meredith Whittaker, president of the Signal Foundation, on safeguarding privacy in the age of A.I., and many, many more, including some of the top venture capital investors backing the generative A.I. boom. All that, plus fly fishing, mountain biking, and hiking. I’d love to have Eye on A.I. readers join us! You can apply to attend here.


Jeremy Kahn
@jeremyakahn
jeremy.kahn@fortune.com

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.