Just seven months after it unveiled its Veo AI video generator, Alphabet division Google DeepMind has announced Veo 2.
The new tool can generate videos of up to 4K resolution, whereas the first Veo could only handle up to 1080p. Google is claiming improvements in the physics of the scenes that the upgraded Veo generates, as well as better "camera control" (there is no real camera involved, but the user can prompt the model for specific camera shots and angles, from close ups to pans to "establishing shots.")
DeepMind also announced an updated version of its Imagen 3 text-to-image model, though the changes—like “more compositionally balanced” images and improved adherence to artistic styles—clearly aren’t big enough to warrant a full new version number. Imagen 3 first rolled out in August.
Veo 2’s step up to 4K suggests DeepMind is pulling ahead of rival AI labs in video generation.
OpenAI finally released its Sora video generator a week ago, after having unveiled it all the way back in February, but the output of Sora (specifically, the Sora Turbo version that is now available to ChatGPT Plus and Pro users) remains limited to a maximum resolution of 1080p. Runway, which is perhaps the most popular of the current AI video generators, can only export at an even fuzzier 720p.
“Low resolution video is great for mobile, but creators want to see their work shine on the big screen,” Google said in a presentation on Veo 2.
Veo 2’s 4K clips are limited to eight seconds by default, but they can be extended to two minutes or more, said a Google spokesperson. Sora’s 1080p clips are capped at 20 seconds.
DeepMind claims that, when comparing Veo 2 to Sora Turbo, 59% of human raters preferred Google’s service, with 27% opting for Sora Turbo. It also claims similar victories against Minimax and Meta's Movie Gen, with Veo 2 preference only slipping slightly below 50% when the rival was Kling v1.5, a service from China’s Kuaishou Technology.
When it comes to “prompt adherence”—i.e. doing what it was asked to do—Veo 2 was preferred at similar rates, according to DeepMind.
The Google unit also claims to have made significant strides in combating “hallucinated” details, like bonus fingers, and in demonstrating “a better understanding of real-world physics and the nuances of human movement and expression.”
The physics issue is one that continues to bedevil video generators. Sora, for example, struggles to generate plausible footage of gymnasts and their complex movements. It remains to be seen how much better Veo 2 will prove in this regard.
Some, like Stanford professor and World Labs co-founder Fei-Fei Li, argue that issues like physics and object permanence can only really be solved with so-called world models that have the “spatial intelligence” to understand and generate 3D environments. Google unveiled its own Genie 2 world model earlier this month, but with a focus on generating environments that can be used to train and evaluate AI “agents” that operate in virtual environments.
The more plausible the output of image and video generators, the greater the risk of them being used for nefarious purposes. DeepMind applies invisible SynthID watermarks to Veo 2 clips, which should make it more difficult to use them for political disinformation, if people are checking videos for such telltale signs of AI origins. The same may not hold true for more mundane fraudulent applications, where victims would be less likely to check the file for invisible watermarks.
By way of contrast, OpenAI’s Sora embeds a visible animation in the bottom right corner of its videos. Sora also uses the open-source C2PA watermarking protocol, an alternative system to SynthID (though Google also joined the C2PA initiative in February.)
Veo 2 is now powering Google Labs’s VideoFX generation tool (which has a resolution cap of 720p,) while the revised Imagen 3 model can now be used in the ImageFX tool. VideoFX is currently only rolling out in the U.S., but ImageFX is available in over 100 countries.
Google DeepMind has not said what data was used to train Veo 2 or the new version of Imagen 3, though it previously hinted that YouTube videos (both companies fall under the Alphabet umbrella) comprised some of the training data for the original Veo.
Many artists, photographers, creators and filmmakers are concerned their copyrighted works have been used to train such systems without their consent. OpenAI has refused to say what data was used to train Sora but the New York Times, citing sources familiar with Sora’s training, has reported that the company used videos from Google’s YouTube service to train the AI model. 404 Media has previously reported that Runway also seems to have used YouTube videos to train Gen 3 Alpha.
ImageFX is not available in Germany, where this writer is based. However, a Google DeepMind spokesperson denied that this had anything to do with the EU's new AI Act, which demands that Big Tech firms provide a detailed summary of what copyright-protected data they use to train their AI models. "We often ramp up experiments in one or limited markets before expanding more broadly," they said.