OpenAI has announced its “most advanced speech-to-speech model” and recent updates to its Realtime API, allowing users to build production-ready voice agents.
OpenAI is certainly no stranger to experimenting with voice capabilities, especially with its ChatGPT product, but this particular enhancement, with improvements to audio quality, intelligence, and instruction following, seeks to provide an overall sleeker and more natural experience.
What Is gpt-realtime?
The latest voice model in the AI company’s suite is known as gpt-realtime, and OpenAI has said that it has been trained extensively with customers to provide better responses for scenarios such as customer support, personal assistance, and education. It claims that the voice capabilities of this new model are better at following more complex instructions, calling appropriate tools at the right time, and “producing more natural-sounding, expressive” speech.
So far, some users have said that the new voices flowed naturally and sounded empathetic, which is likely to be great for some and not for others. Research from the Harvard Business Review at the beginning of the year detailed research that determined that consumers don’t actually want AI to sound human, but different research from Texas Christian University concluded that it depends on the user and the use case.
OpenAI’s new model has also been designed to align with the way developers build and deploy voice agents, with a particular focus on action calling and how that functions. “Function calling”, as OpenAI refers to it, has improved in terms of calling relevant functions, calling functions at the appropriate time, and calling functions with appropriate arguments (resulting in higher accuracy).
Updates to Realtime API
The company’s Realtime API – an API designed to help developers build live, low-latency AI experiences in real time – is now generally available and supports remote MCP servers, image inputs, and phone calling through Session Initiation Protocol (SIP), making voice agents more capable through access to additional context and tools.
The image update is especially interesting, as now the model can ground the conversation in what the user is seeing, and the user can ask specific questions based on the images.
Additionally, the new ability to connect apps to the public phone network PBX systems, desk phones, and other SIP endpoints with direct support likely presents an interesting opportunity for many customer-facing businesses.
The Voice Agent Race Continues
As this move from OpenAI solidifies its place in the sphere of AI voice agents, it also poses the question of who will make the next move.
Earlier this year, Salesforce teased the upcoming voice capabilities for its proprietary AI Agentforce, and Salesforce’s ex-CEO, Bret Taylor, has launched his own AI voice startup geared towards a more genuine and empathetic AI agent approach, equipped with voice capabilities that allow the agent to speak naturally, understand acronyms, take actions, and more.
The market for AI voice agents could be much vaster than anticipated, and it will be interesting to see how customers and users respond to the different options, as well as whether a more human-sounding AI is something that they actually strive towards or not.
Summary
OpenAI’s new AI voice model is the purposeful next step of the company’s venture into more seamless, human-like AI agents. The kinds of results that this model can produce will be something to watch, especially if other companies begin enhancing or creating their very own voice agents.