OpenAI Unveils New Voice AI Models with Custom Voices

Despite past controversies around voice imitation, OpenAI is doubling down on voice AI. The company unveiled three new voice models this week—gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-mini-tts—designed to improve audio transcription and text-to-speech experiences for developers and users.

Initially available through OpenAI’s API, these models give third-party developers the tools to build voice-driven apps with improved accuracy and customization. OpenAI also launched a demo site, OpenAI.fm, where users can experiment with the new voices in creative ways.

AI Voices You Can Customize — Accents, Emotions, and More

What sets gpt-4o-mini-tts apart is the ability to customize voice output through text prompts. Users can tweak accent, pitch, tone, and even emotional delivery—from a calm yoga instructor to a dramatic villain.

OpenAI says this level of control is designed to address concerns about voice cloning, offering flexibility while avoiding unintentional imitation of real individuals—a problem that landed the company in hot water with Scarlett Johansson last year.

“Now, it’s the user who decides how the AI should sound,” explained OpenAI’s Jeff Harris during a demo with VentureBeat.

Built on GPT-4o, But Trained for Speech and Transcription

The new models are refined versions of OpenAI’s GPT-4o, post-trained with extra data for transcription and speech generation. OpenAI says this update improves accuracy, noise resilience, and the ability to handle diverse accents and speech speeds across 100+ languages.

According to Harris, the new family is not designed for speaker diarization (separating individual speakers) but focuses on capturing audio as a single stream and responding naturally.

OpenAI claims the gpt-4o-transcribe model achieves a 2.46% word error rate in English, significantly outperforming its older Whisper model.

Thanks to new features like noise cancellation and semantic voice activity detection, the models better understand when speakers finish a thought, improving flow and accuracy.

OpenAI is also introducing streaming speech-to-text, enabling continuous audio input and real-time transcription—a feature designed for more natural, low-latency conversations.

Developers can now integrate voice interactions into their apps with just nine lines of code using OpenAI’s new Agents SDK. This unlocks use cases like AI customer service, meeting transcription, or voice-driven e-commerce assistants.

“For the first time, developers can build fluid voice experiences quickly and easily,” said Harris.

Pricing for the new models starts at $0.003 per minute for gpt-4o-mini-transcribe and $0.015 per minute for text-to-speech with gpt-4o-mini-tts—making them competitive against rivals like ElevenLabs and Hume AI.

Fierce Competition in AI Speech Tech Heats Up

OpenAI’s latest launch enters a market crowded with fast-evolving AI speech models. ElevenLabs recently introduced Scribe, which supports speaker diarization and boasts a 3.3% English error rate.

Hume AI‘s new Octave TTS takes things further, letting users control word-level emotion and pronunciation—entirely through text prompts. Meanwhile, Orpheus 3B, an open-source competitor, offers similar audio capabilities under an Apache 2.0 license.

Companies like EliseAI and Decagon have already integrated OpenAI’s new audio models. EliseAI reported that emotionally rich voices boosted tenant engagement during property management calls.

Decagon saw a 30% improvement in transcription accuracy, even in noisy environments—a change that improved real-world AI performance significantly.

Not everyone welcomed the release. Ben Hylak, co-founder of Dawn AI and former Apple designer, criticized the update on X, suggesting it signals a shift away from real-time AI voice—a key strength of ChatGPT.

Adding to the drama, details about the new models leaked early on social media, with the @testingcatalog account sharing names of the new models before the official announcement.

Still, OpenAI remains focused on refining voice capabilities, balancing customization and ethical AI use. The company also teased future investments in multimodal AI, including video, hinting at more dynamic, interactive agents on the horizon.

Share with others