Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Tom’s Guide
Tom’s Guide
Technology
Ryan Morrison

I test AI for a living and Google’s Gemini Pro 1.5 is a true turning point

Google Gemini 1.5 logo.

Google dropped a surprise on the AI world on Thursday with the release of Gemini Pro 1.5, a new version of its recently released next-generation AI model Gemini Pro.

A company the size of Google announcing a new update isn’t exactly ground breaking, but what is significant about Gemini Pro 1.5 is just how much better and different it is to version one.

Gemini Pro 1.5 has a significantly larger context window than any other model on the market, up to 10 million tokens compared to, for example, 128,000 tokens for GPT-4.  It is also now technically more powerful than Gemini Pro 1.0 and Gemini Advanced 1.0 which powers the paid-for version of the Gemini chatbot.

Google claims its new model is also more reliable, able to pick out specific moments and works across video, audio, images and text natively. This is a big deal as AI moves into the real world through AR interfaces like the Meta Quest, Apple Vision Pro or RayBan smart sunglasses.

What makes Gemini Pro 1.5 so different?

(Image credit: Google)

Gemini Pro 1.5 has a staggering 10 million token context length. That is the amount of content it can store in its memory for a single chat or response. 

This is enough for hours of video or multiple books within a single conversation, and Google says it can find any piece of information within that window with a high level of accuracy.

Jeff Dean, Google DeepMind Chief Scientist wrote on X that the model also comes with advanced multimodal capabilities across code, text, image, audio and video.

He wrote that this means you can “interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines across hundreds of files, full movies, entire podcast series, and more."

In “needle-in-a-haystack” testing where they look for the needle in the vast amount of data stored in the context window, they were able to find specific pieces of information with 99.7% accuracy even with 10 million tokens of data.

What are the use cases for such a large context window?

While promoting the new model Google showed off its video analysis ability on Buster Keaton’s silent 45 minute movie “Sherlock Jr.” 

Scanning a single frame per second, the input required a total of 648,000 tokens. Gemini could then answer questions about the movie such as "tell me about the piece of paper removed from the person's pocket" and give the exact time code.

Gemini was able to say exactly what was written on the piece of paper and the exact moment the paper is shown in full on screen. 

In another example developers were able to make a quick sketch of a scene, not particularly well drawn and using stick figure art. It gave it to Gemini and asked it to give the timestamp for that particular scene — Gemini returned the exact timestamp accurately in seconds.

Other surprising benefits of a large context

(Image credit: Google)

One other aspect that hasn’t been widely reported is the potential for saving, learning and even creating new languages. Jim Fan, a senior researcher and AI agent expert at Nvidia pointed out the zero shot ability of Gemini Pro 1.5 to understand linguistics in a suprising way.

He wrote on X that:  “v1.5 learns to translate from English to Kalamang purely in context, following a full linguistic manual at inference time.” 

Kalamang is a language spoken by about 200 people in New Guinea and Gemini had no information on the language during its original training — so it had nothing to drawn from.

For the test it was given 500 pages of linguistic documentation, a dictionary, and about 400 parallel sentences and the context in which they apply. It used this to learn the language and be able to offer translations of any words or phrases from English to Kalamang and back.

When will I be able to use Gemini Pro 1.5?

Gemini Pro 1.5 is already available to some enterprise customers using Vertex AI or Google Cloud’s Generative AI studio. At some point it will come to the Gemini chatbot but when it does the maximum context will likely be closer to 128,000, similar to ChatGPT Plus.

This is a game changing moment for the AI sector in a day that also saw OpenAI launch a video model and Meta find a way to use video to teach AI about the real world.

What we are seeing with these advanced multimodal models is the interact of the digital and the real, where AI is gaining a deeper understanding of humanity and how WE see the world.

More from Tom's Guide

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.