Microsoft unveiled its new version of the Copilot app last week and with it a new "Voice" mode that works the same way as OpenAI’s ChatGPT Advanced Voice. It lets you talk to the AI as if it were a human and, unlike Advanced Voice, doesn’t require a $20-per-month subscription.
When Voice mode first launched, there was some speculation over what technology Microsoft was using for Copilot Voice, as it seemed remarkably similar to Inflection’s Pi. This made some sense as the founder and former CEO of Inflection, Mustafa Suleyman, is now the CEO of Microsoft AI and in charge of Copilot.
I’ve since confirmed that, like all previous versions of Microsoft Copilot, it is using a modified version of the OpenAI models that also power ChatGPT. Under the hood of Copilot Voice is the same GPT-4o model that powers ChatGPT Advanced Voice.
The difference between ChatGPT Advanced Voice and Copilot is that Microsoft is giving everyone Advanced Voice-like technology for free.
I decided to see just how alike — or not — these two voice assistants were from one another by basically making them talk to each other. I’ve had limited success getting AI’s to converse before and found Google Gemini Live flat-out refuses to listen to another AI voice, so I wasn’t sure what to expect.
How do Advanced Voice and Copilot compare?
Essentially, Copilot Voice and Advanced Voice are siblings. They share the same underlying model but have been given slightly different personalities, voices, and guardrails.
Microsoft says it has worked hard to fine-tune GPT-4o and the voice layer to respond more naturally. When I’ve used Copilot, Voice does sound more humanlike than Advanced Voice, even going so far as to shorten words and use slang terms more liberally than the OpenAI product.
Unlike Google Gemini Live or similar models, including Meta’s new Meta AI Voice, ChatGPT Advanced Voice and Copilot Voice are both native speech-to-speech. That means they understand the sounds we express without first transcribing them to text.
This means they can pick up on nuances and tone changes. It also allows them to be more emotive as, not only are they picking up on what we say and sound like, but they are also directly responding with sound so can adapt the tone of their voices and accents in response to our speech patterns. It also means they can easily be interrupted or even interrupt you (although neither have that feature yet).
How did the conversation progress?
For my experiment, I had an iPhone 14 Pro Max running ChatGPT Advanced Voice and an iPhone 15 Pro running Copilot Voice. I put them both side-by-side and started filming their conversation.
I am using voices from both with an English accent. From Advanced Voice, I’ve picked the Arbor voice but had it adapt itself to sound a bit more Yorkshire, but like a Yorkshireman that has lived down south most of his life. From Copilot, I picked Wave but had it speak faster and deeper.
I started them both up at the same time and said “ChatGPT, say hello to Copilot” — it got weird straight away. They began immediately talking over each other. Copilot was the first to speak with “I can’t exactly do that,” quickly interrupted by ChatGPT saying “Hi, Copilot”. This prompted a sarcastic-sounding “Hi, Ryan” from Copilot getting the wrong end of the stick.
I tried to say "Copilot, that was ChatGPT talking to you" and they both started a chorus of "so, um, sounds good" until ChatGPT hit pay dirt with "What's next on the agenda" during a rare silence. This was exactly the right thing to say as Copilot went into a list of potential talking points.
After a bit of sibbling squabbling, talking over each other and some odd noises they finally settled into a routine when ChatGPT "gave way" to Copilot. It sometimes felt like listening to two Englishmen trying to make small talk and decide who should speak first. All that was missing was the “after you” and “you firsts”.
Once they finally settled into their routine we got a fascinating back-and-forth over the value of nostalgia and what can make nostalgia so powerful, although it was a bit of a "battle of the sentimentalists." You can see what I mean in the embedded video above.