Rumors that OpenAI has been working on something major have been ramping up over the last few weeks, and CEO Sam Altman himself has taken to X (formerly Twitter) to confirm that it won’t be GPT-5 (the next iteration of its breakthrough series of large language models) or a search engine to rival Google. What a new report, the latest in this saga, suggests is that OpenAI might be about to debut a more advanced AI model with built-in audio and visual processing.
This AI model could be revealed imminently in the company's 'Spring Update' event, which kicks off soon at 10am PT / 1pm ET / 6pm BST – here's how you can tune in to watch OpenAI's event.
OpenAI is towards the front of the AI race, striving to be the first to realize a software tool that comes as close as possible to communicating in a similar way to humans, being able to talk to us using sound as well as text, and also capable of recognizing images and objects.
The report detailing this purported new model comes from The Information, which spoke to two anonymous sources who have apparently been shown some of these new capabilities. They claim that the incoming model has better logical reasoning than those currently available to the public, being able to convert text to speech. None of this is new for OpenAI as such, but what is new is all this functionality being unified in the rumored multimodal model.
A multimodal model is one that can understand and generate information across multiple modalities, such as text, images, audio, and video. GPT-4 is also a multimodal model that can process and produce text and images, and this new model would theoretically add audio to its list of capabilities, as well as a better understanding of images and faster processing times.
The bigger picture that OpenAI has in mind
The Information describes Altman’s vision for OpenAI’s products in the future as involving the development of a highly responsive AI that performs like the fictional AI in the film “Her.” Altman envisions digital AI assistants with visual and audio abilities capable of achieving things that aren’t possible yet, and with the kind of responsiveness that would enable such assistants to serve as tutors for students, for example. Or the ultimate navigational and travel assistant that can give people the most relevant and helpful information about their surroundings or current situation in an instant.
The tech could also be used to enhance existing voice assistants like Apple’s Siri, and usher in better AI-powered customer service agents capable of detecting when a person they’re talking to is being sarcastic, for example.
According to those who have experience with the new model, OpenAI will make it available to paying subscribers, although it’s not known exactly when. Apparently, OpenAI has plans to incorporate the new features into the free version of its chatbot, ChatGPT, eventually.
OpenAI is also reportedly working on making the new model cheaper to run than its most advanced model available now, GPT-4 Turbo. The new model is said to outperform GPT-4 Turbo when it comes to answering many types of queries, but apparently it’s still prone to hallucinations, a common problem with models such as these.
The company is holding an event today at 10am PT / 1pm ET / 6pm BST (or 3am AEST on Tuesday, May 14, in Australia), where OpenAI could preview this advanced model. If this happens, it would put a lot of pressure on one of OpenAI’s biggest competitors, Google.
Google is holding its own annual developer conference, I/O 2024, on May 14, and a major announcement like this could steal a lot of thunder from whatever Google has to reveal, especially when it comes to Google’s AI endeavor, Gemini.