OpenAI has introduced new generative AI models for transcription and voice synthesis, which are integrated into the API. The new models, named gpt-4o-mini-tts and gpt-4o-transcribe, promise to improve upon previous versions by offering more realistic sound and the ability to customize different speaking styles. For example, developers can instruct the model to speak “like a mad scientist” or with “a calm voice, like a meditation teacher.”
The new models transform text into speech with greater accuracy and can reproduce emotional nuances in the voice. This can be useful for a variety of applications, such as customer support, where it is important to convey apologies or empathy through voice. According to OpenAI representatives, this allows users and developers to control not only what is said, but also how it sounds.
The gpt-4o-transcribe model replaces the previous Whisper model for transcription. It is trained on a diverse set of high-quality audio data, enabling better recognition of accents and various language variations, even in challenging conditions. This significantly reduces the likelihood of errors that previously occurred with Whisper, such as invented words or phrases in transcripts.
Despite the improvements, OpenAI does not plan to openly release the new transcription models. Company representatives note that the new models are significantly larger than Whisper and are not optimal for local use on regular devices. They emphasize the importance of a cautious approach to open sourcing in order to ensure the models meet specific needs.