ElevenLabs, a company known for its achievements in audio generation, has announced the launch of its first standalone speech-to-text model called “Scribe.” This model supports over ninety-nine languages and offers recognition accuracy for more than twenty-five languages, where the error rate is less than five percent. Among these languages are English, French, German, Spanish, and many others.
Introducing Scribe — the most accurate Speech to Text model.
— ElevenLabs (@elevenlabsio) February 26, 2025
It has the highest accuracy on benchmarks, outperforming previous state-of-the-art models such as Gemini 2.0 and OpenAI Whisper v3.
It’s now the leading model for English, Spanish, Italian, and many more. With support… pic.twitter.com/A6TzLzFEUL
“Scribe” can handle real-world audio scenarios and provides features such as speaker diarization, word-level timestamps for precise subtitles, and automatic sound event labeling. The model is available to developers via the API and the ElevenLabs dashboard, where users can upload audio or video files.
At launch, “Scribe” works only with pre-recorded audio formats, but the company plans to release a low-latency version for real-time use soon. This opens up additional opportunities for the model, particularly for creating subtitles for videos or content that requires accurate speech recognition.
The transcription service is priced at forty cents per hour of audio, which is competitive in the market. While some competitors offer lower prices, “Scribe” delivers high accuracy and additional features.