A YouTuber has recreated Google’s misleading Gemini Ultra video, showing real-time responses to changes in a live video — using OpenAI’s vision AI model GPT-4V.
Google unveiled its impressive sounding Gemini artificial intelligence models last week, including the flagship Gemini Ultra with a video that appeared to show it responding in real-time to changes in a video — the problem is, Google faked it.
The reality of the promotional clip from Google is that they did have Gemini Ultra solve the problems demonstrated but from still images and over a longer period.
To see whether it is even possible to do things like have an AI play the find the ball game, identify locations on a map or spot changes in an image as you draw it, Greg Technology created a simple app to see how well GPT-4V handles the same concept.
So what exactly happened with Gemini?
Gemini Ultra has been trained as multimodal from the ground up. That means its dataset included images, text, code, video, audio and even motion data. It allows it to have a broader understanding of the world and see it “as humans do”.
To demonstrate these capabilities Google released a video where different actions were being performed and the voice of Gemini was describing what it could see.
In the video, it seems like this is all happening live, with Gemini responding to changes as they happen, but this isn’t exactly the case. While the responses are real they were still images or in segments rather than in real-time. Put simply, the video was more a marketing exercise than a technical demo.
So OpenAI GPT-4 can already do this?
In a short two-minute video, Greg, who makes demos of new technology for his channel, explained being excited by the Gemini demo but was disappointed to find it wasn't real time.
“When I saw that I thought, that is kind of strange, as with GPT-4 vision, which came out a month ago has been doing what is in the demo only it is real,” he said.
The conversation with GPT-4 is similar to the Voice version of ChatGPT with responses using a similar natural tone. The difference is this included video and had the OpenAI model respond to hand gestures, identify a drawing of a duck on water and play rock, paper, scissors.
The code used to make the ChatGPT Video interface used in the demo video has been released on GitHub by Greg Technology so others can also try it out for themselves.
Trying out the GPT-4 Vision code
I installed the code produced by Greg Technology on my Apple MacBook Air M2 and paired it with my GPT-4V API key to see if this video worked and wasn’t another “fake demo”.
After a few minutes, I had it installed and running and it worked perfectly. Happily identifying hand gestures, my glass coffee cup and a book. It could even tell me its title and author.
What this shows is just how far ahead of the pack OpenAI is, especially in terms of multimodal support. While other models can now analyze the contents of an image, they’d struggle with real-time video analysis.