OpenAI Introduces New Speech Models Through API

OpenAI Unveils Advanced Speech-to-Text and Text-to-Speech Models
Recently, OpenAI has launched innovative speech-to-text and text-to-speech models through its application programming interface (API). These advancements empower developers to create speech agents that are more sensitive to tone and expression, making interactions more natural and engaging.
In addition to these new tools, OpenAI has introduced several features over the past few months, such as Operator, Deep Research, Computer-Using Agents, and the Responses API. These tools primarily focused on enhancing text-based agents. However, as reported by Neowin, the latest audio models allow developers to design customizable, powerful, and expressive speech agents.
Improvements in Audio Processing
The newly launched models, named gpt-4o-transcribe and gpt-4o-mini-transcribe, demonstrate remarkable advancements in key areas such as word error rates, language recognition, and overall accuracy. These improvements surpass those of the existing Whisper models. This progress is attributed to reinforcement learning techniques and extensive mid-training, with developers utilizing diverse and high-quality audio datasets.
Understanding Nuances in Speech
OpenAI asserts that the new audio models excel at comprehending the subtleties and nuances of spoken language. They exhibit reduced speech recognition errors and produce reliable transcriptions even when dealing with audio that features accents, background noise, or varying speaking speeds. This capability is particularly beneficial for applications requiring accurate voice recognition across diverse environments.
In terms of text-to-speech capabilities, the gpt-4o-mini-tts model represents a significant improvement. It allows developers to offer explicit instructions on text pronunciation, enhancing the overall user experience. Currently, however, this model is limited to artificial, preset voices.
Pricing Structure for New Models
OpenAI has announced the pricing for these advanced models as follows:
- gpt-4o-transcribe: $6 per million audio input tokens, $2.50 per million text input tokens, and $10 per million text output tokens.
- gpt-4o-mini-transcribe: $3, $1.25, and $5 per million tokens, respectively.
- gpt-4o-mini-tts: $0.60 per million text input tokens and $12 per million audio output tokens.
For an estimated cost per minute, the breakdown is:
- gpt-4o-transcribe: approximately 0.6 cents per minute
- gpt-4o-mini-transcribe: approximately 0.3 cents per minute
- gpt-4o-mini-tts: approximately 1.5 cents per minute
The OpenAI team has indicated an ongoing commitment to enhancing the intelligence and accuracy of these audio models. They aim to explore options for developers to integrate custom voice solutions, thus creating more tailored user experiences, all while adhering to OpenAI’s strict security standards.
API Access and Integration
These advanced audio models are now accessible to all developers through the OpenAI APIs. Additionally, OpenAI has announced a new integration with the Agents SDK to simplify the process of building speech agents. For those interested in real-time applications, OpenAI recommends utilizing the Real-time API, which allows for low-latency speech-to-speech experiences.