What you need to know
- With Google I/O 2024 ongoing, the company highlights several AI updates, such as a new Gemini model known as 1.5 Flash.
- Gemini Nano on Android is getting an upgrade and will soon be able to handle images from users and "Multimodality."
- The company also unveiled Veo (video generation) and Imagen 3, which are touted as Google's "highest quality" text-to-image generator.
With Google's I/O 2024 event underway, the company highlights several ways its AI model Gemini is set to improve with a new family and AI agents.
As detailed in a blog post, Google is welcoming in a new member to its Gemini ecosystem known as "Gemini 1.5 Flash." Following the launch of 1.5 Pro in February, the company states it quickly became apparent that its applications needed "lower latency and lower cost to serve." The 1.5 Flash model for Gemini is said to be lighter than its Pro sibling, with faster speeds and more efficiency, as a result.
Google states Flash can handle "high-volume, high-frequency tasks at scale." The company adds that Flash can take care of large chunks of data and deliver quality that supersedes its size. Testing shows that the 1.5 Flash model thrives in summarization, chat applications, image/video captioning, data extraction, tables, and more.
Flash joins Pro in the public preview space with a 1 million token window in AI Studio and Vertex AI. For developers leveraging the API of Google Cloud customers, the window has increased to 2 million tokens.
Speaking of the 1.5 Pro model, Google says it's continued to plug away at updates, with its most recent one set to improve the AI's reasoning and coding. Image and video benchmark understanding for the model arrives for MMMU, AI2D, MathVista, ChartQA, DocVQA, InfographicVQA, and EgoSchema.
The next-gen Pro model has been upgraded to follow "increasingly complex" and "nuanced" instructions. Google states users can even specify product-level behavior such as roles, format, and style to the 1.5 Pro model. Audio understanding comes in for the Gemini API and AI studio, meaning the 1.5 Pro model can provide "reason" for images and videos uploaded to the latter.
Google adds that there are plans to implement the 1.5 Pro model into Gemini Advanced and Workspace apps. During I/O, it was teased that the upgraded model would arrive for Gmail and NotebookLM. Google quickly demonstrated 1.5 Pro's Multimodality power in its recent AI-powered note-taking app. It showed that users can have the AI simulate a conversation where it delivers larger pieces of information in a more digestible way.
More importantly, users can chime in and ask questions to the AI and have it provide relevant responses. Additionally, if you're child is eager to learn a new topic, Gemini 1.5 Pro can deliver age-appropriate responses if required.
Updates for Nano
Google's Gemini Nano model for Android is getting some updates. The post states that the model is moving beyond text inputs and will soon start accepting images for users. Pixels are first in line, as Google states, Nano with Multimodality will begin to "understand the world the way people do—not just through text input, but also through sight, sound, and spoken language."
Gemma, the open-source model that leverages the same tech behind the Gemini models, is receiving version 2. The company highlights Gemma 2's revamped architecture for "breakthrough" performance and efficiency—alongside new sizes. Gemma will soon pick up PaliGemma, a language model inspired by PaLI-3.
Google is taking its AI agents seriously as part of DeepMind's mission to be responsible and to ensure its helpfulness for the everyday user. The new universal AI agents the company has in store were created to process information faster by encoding video frames. Google adds that agents can combine speech and video to create a timeline of events and catch information relevant for recall.
DeepMind's work extends to implementing Google's speech models to help agents understand the context and quicken their response time during conversations.
A new wave of video and image generation
The other side of this substantial Gemini I/O 2024 update is Google's Veo and the new Imagen 3. Veo is said to have an "advanced understanding of natural language and visual semantics..." The video generation model is also teased about having the ability to generate visuals that are close to the user's original idea.
Longer prompt understanding and capturing the right tone are also said to be in its wheelhouse.
Veo reportedly builds on Google's other video generation work like Generative Query Network (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet, and Lumiere. Beginning today (May 14), Google states Veo is available to "select" creators as a private preview through VideoFX. There's a waitlist creators must sign up for.
However, Google adds that there are plans to bring Veo to YouTube Shorts and "other products."
Imagen 3's debut sees Google calling the model its "highest quality text-to-image model." Higher levels of detail, photorealism, lifelike images, and more are said to come from Imagen 3's capabilities. The text-to-image model can understand natural speech and the intent behind your prompt a little better than its predecessor. Imagen 3 is also said to not skip out on those smaller details we might include in those longer prompts.
The next iteration of Google's text-to-image model is available today (May 14) for "select" creators through ImageFX as a private preview. Users can sign up to join its waitlist. Google teases Imagen 3's upcoming integration with Vertex AI, similar to Imagen 2.