Hume AI, the empathetic AI voice company, has just unveiled OCTAVE (Omni-Capable Text And Voice Engine). This new model combines the capabilities of the EVI 2 speech-language model with advanced emotional and cloning functionality. And all in an extremely small form factor.
OCTAVE can take a prompt or brief recording, and generate not just words but also expressive emotions, dialects, and other components of a full personality.
The product can also understand a wide range of prompt instructions, for example, the user can request a ‘gentle therapist’ or an ‘excitable salesman’, and the model will output the required response instantly and with minimal latency.
A key feature of the new model is the fact that it can do all this on the fly, and so users can instantly create emotive expressive characters just by uploading a short five second audio sample or entering a prompt which sets the scene.
This is a big improvement over most current voice models, which generally only offer a limited set of 'personality' voice types, and opens up the field to much more engaging AI interactions with zero training.
The demo launch clips demonstrate a wide range of prompt variation, including different accents and temperaments, as well as vague terms such as ‘favorite uncle’ or a ‘voice that barrels through conversations like rush hour traffic.’
The results are good, although it has to be said they’re not perfect. It’s still possible to detect artifacts bleeding through the audio at times, especially with the more exotic prompt demands.
What’s slightly more impressive, and actually quite worrying, is the model’s ability to clone a voice. Again there are demos on the launch page which demonstrate a striking ability to mimic voices, including that of Ilya Sutskeva and Humes own Lauren Kim.
Having cloned a voice, the model can then combine it with real-time conversation on the fly. It's not hard to visualize how this could be abused in the future.
Potential use cases
This multi-voice capability can be used to create instant podcasts which integrate real live human chat with a cloned voice of choice. Imagine setting up a podcast with you in conversation with Barack Obama or John Wayne, without any lengthy tedious training.
This even goes so far as the ability to clone multiple characters, such as taking an existing podcast from Google’s NotebookLM and using it to generate a completely new conversation on the fly.
With a nod towards edge devices such as smartphones, OCTAVE is a modest model, featuring just 3B parameters. The implication is that the new product will give voice to many more smaller devices, and maybe even consumer appliances, opening up a whole new universe of interactive possibilities.
The product is only available to a select number of trusted testers at the moment, to ensure safety and functionality, but the company plans a wider rollout over the next few months.