The French company Mistral introduced Voxtral — an open language model for speech recognition and understanding. It is available in two versions: Voxtral Small 24B for production applications and the compact Voxtral 3B for local or edge use. Both support a context window of 32,000 tokens, allowing the processing of audio files up to 30 minutes for transcription and up to 40 minutes for understanding tasks.
The model recognizes English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian languages. It combines transcription, question answering, and summarization without the need for additional language or recognition modules. Users can run functions in the backend through voice commands, as the model automatically converts requests into API calls.
According to Mistral’s tests, Voxtral Small outperforms Whisper large-v3, GPT-4o mini Transcribe, and Gemini 2.5 Flash in most tasks, particularly in English short fragments and multilingual FLEURS tests. The model also showed competitive results in audio understanding and speech translation, and Voxtral Mini Transcribe works more accurately and cheaply than OpenAI Whisper.
The Voxtral API offers pricing from $0.001 per minute, and for corporate clients, private installation and fine-tuning for industry needs are available. Upcoming updates will include voice segmentation, emotion and age markup, as well as word-level timestamps.
The models are already available for download on Hugging Face under the Apache-2.0 license and through the API. In the coming weeks, Voxtral will become the basis of the voice mode in Le Chat, allowing users to dictate messages and interact with the platform by voice both on the web version and on mobile devices.