![](https://cdn.mos.cms.futurecdn.net/mmc5YmLsSKG9vH2DkFseoK.jpg)
On Sunday, OpenAI unveiled Deep Research — an agentic AI tool that can conduct multi-step research on the internet for complex tasks. The ChatGPT maker says the tool can simulate a human research analyst and claims what the agent accomplishes in ten minutes would take several hours for a human equivalent.
And as it now seems, the tool is living up to the hype. According to shared benchmarks on debatably the hardest AI exam, Humanity's Last Exam, which was released less than two weeks ago, Deep Research holds a significant lead ahead of ChatGPT03-mini and DeepSeek's R1 V3-powered model (via TechRadar).
For context, the AI exam was created by some of the smartest experts across the world and features some of the most complex questions. DeepSeek previously held a significant lead against other proprietary models with a 9.4% accuracy score.
However, the Chinese AI model was dethroned from the top spot following the launch of OpenAI's o3-mini model with a 10.5% accuracy score. Things got a tad interesting when the setting was adjusted to o3-mini-high, pushing the accuracy score to 13%. The difference between both settings is attributed to the fact that the latter takes longer to analyze and reason when presented with a complex query.
On the other hand, OpenAI's new Deep Research agentic AI tool scored 26.6% in Humanity's Last Exam, translating to a 183% increase in result accuracy.
Granted, the tool ships with resourceful search capabilities, which allows it to scour the web for answers to some of the general knowledge questions featured in the complex test. Ultimately giving it a competitive advantage over other models in the running.
An OpenAI employee referred to his user experience with Deep Research as "a personal AGI moment," indicating:
"Using Deep Research has been a personal AGI moment for me. It takes 10 mins to generate accurate and thorough competitive and market research (with sources) that previously used to take me 3 hours."