Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Fortune
Fortune
David Meyer

Reliable ‘reasoning’ AI agents may be just around the corner thanks to DeepSeek’s innovations, say researchers

(Credit: Jaap Arriens—NurPhoto/Getty Images)

Innovations made by China’s DeepSeek could soon lead to the creation of AI agents that have strong reasoning skills but are also small enough to run directly on people’s computers and mobile devices, according to a researcher at the open-source AI organization Hugging Face.

Starting with OpenAI’s o1 last September, the past several months have seen the emergence of AI models that can “reason” in a sense, by doing step-by-step thinking. DeepSeek astonished the sector two weeks ago by releasing a reasoning model called R1 that could match o1’s performance in many tasks, despite the fact that it cost a fraction as much to train.

DeepSeek achieved this through a combination of clever algorithmic advances and optimization of the hardware used in the training. DeepSeek also showed that it was fairly easy to transfer reasoning capabilities from a big model like R1 into a much smaller model like Meta’s Llama-8B, in a process called distillation. What’s more, it open-sourced much of its work—big models and smaller, distilled versions—allowing others to freely build on its achievements.

Within days, the team at Hugging Face kick-started a new community project called Open-R1 that aims to replicate what DeepSeek did. Hugging Face researcher Lewis Tunstall told Fortune on Tuesday that the work was “going quite fast” and would shortly have a big impact in the red-hot field of “agentic” AI, which essentially involves AI systems that can autonomously perform tasks on the user’s behalf.

“One of the big bottlenecks [with AI agents] has always been reliability—how do you make sure these agents don’t hallucinate the wrong decision and, for example, delete all your emails?” Tunstall said. “One big advantage of these reasoning models is they seem to be far more capable of detecting their own errors and therefore potentially being more reliable. So what I expect will happen in the coming months is that people will use these methods that were pioneered by R1 to try and create reliable agents which then can run on many different devices.”

Tunstall said some of these agents would be free for people to download and use, as is the way with open-source technology, though proprietary models based on DeepSeek’s advances were also likely. “I expect many AI agent companies are looking for ways to distill the reasoning traces from DeepSeek-R1 into smaller models that can power their products,” he said, referring to the record of logical steps, as well as the model’s own internal commentary on the strategies it is trying, that the model outputs.

Opening the rest

Much like Meta with its open-source Llama models, DeepSeek’s open-sourcing of R1 and its underlying base model, V3, came with limits. It released the models themselves, and several distilled versions of R1, and the “weights” that allow developers to customize DeepSeek’s models—but, although it outlined the algorithmic “recipe” it used to train its reasoning models, it did not release the recipe itself, nor the datasets used in that training.

The Open-R1 project is essentially about filling in those blanks, primarily so anyone can replicate the “post-training” method that DeepSeek used to refine R1 out of V3. Ultimately, the project may also make it possible to replicate the “pre-training” method that DeepSeek used to make V3, though as Hugging Face already hosts nearly a million pre-trained models in its repository, it’s focusing on the post-training aspect first.  

“We will have in a few weeks a first end-to-end demonstration of the post-training method from all the datasets all the way to the final models,” said Tunstall, adding that the next big question would be how easy it is to scale that recipe to the larger V3 model.

The project has already figured out how to implement DeepSeek’s novel reinforcement-learning algorithm, which the Chinese firm named Group Relative Policy Optimization (GRPO).

Reinforcement learning is a popular technique used to make an AI model perform better on a certain task—you give the model a problem of some kind, then give it a positive or negative signal based on whether it generates the right answer or not, and then you repeat the process until the model is very good at performing the task correctly. DeepSeek’s big innovation on this front was to remove the need for human evaluators to provide that positive or negative signal, thus making the training process much more efficient.

“We put together the training script for the community to immediately start playing with [GRPO], and we’ve already seen lots of very nice examples of people taking this code and then showing that, if you apply it to a whole range of different models, it actually works,” said Tunstall. “You can take a model like Llama and show and teach it how to do mathematics almost from scratch.”

Tunstall said he had already seen people taking models that are small enough to run in a browser and instilling so-called reasoning capabilities into them, using the script that Open-R1 has already released.

Now the project is working on creating datasets, mostly of math problems, that AI engineers can use to train new reasoning models with DeepSeek’s techniques. (As for DeepSeek’s own dataset, OpenAI has alleged that the Chinese company used o1’s outputs in contravention of its terms and conditions.) Other open-source projects, such as Open Thoughts, are trying to do the same thing.

“In the coming months we’re going to see this explosion of both datasets for reasoning and insights in how to actually train these models effectively, for multiple groups,” said Tunstall. “That to me is the exciting thing about open-source. It’s not a zero-sum thing. Our hope is that collectively we can decipher the secrets of R1.”

Open-source AI is a controversial subject, as such models tend to be less safe and secure than their proprietary, closed-source rivals. In fact, recent independent evaluations of DeepSeek’s R1 have found the model’s guardrails can be easily overcome through common “jailbreaking” methods—which involve designing prompts that trick the model into bypassing its guardrails. Once overcome, the model can generate responses that might be harmful, including generating potential malware and offering people help in potentially dangerous activities, from financial fraud to bioterrorism.

But, as Open-R1 is making clear, there’s little chance of going back. Meanwhile, open-source proponents argue that this collaborative approach could prove beneficial for the democratization and advancement of AI.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.