The Qwen AI group of Alibaba introduced the new AI model Qwen3-Omni, which works with text, images, audio, and video in real-time. Qwen3-Omni processes text in 119 languages, recognizes speech in 19 languages, and responds in ten. The model can transcribe up to 30 minutes of audio, and its response delay is only 234 milliseconds. For convenient use, the architecture is divided into two parts: “Thinker” analyzes input data and creates text, while “Talker” immediately converts it into speech, ensuring quick voice output of the result.
Qwen3-Omni showed high results in 32 out of 36 tests on audio and video tasks, outperforming the Gemini 2.5 Flash and GPT-4o models in speech recognition and voice generation. The model uses a mixture-of-experts architecture with the activation of three billion parameters during each request, allowing for fast processing and stable performance even when working with multiple data types simultaneously.
Users can customize the behavior of Qwen3-Omni through special instructions, such as changing the style or “personality” of responses. The model integrates with other tools and services to perform complex tasks. It is available in Qwen Chat , as a demo on Hugging Face, and developers can connect it to their applications via API from Alibaba.
In addition to the basic version, Alibaba released a specialized model Qwen3-Omni-30B-A3B-Captioner for detailed audio descriptions, such as music or sound effects. Versions Qwen3-Omni-30B-A3B-Instruct for instruction execution and Qwen3-Omni-30B-A3B-Thinking for complex reasoning tasks have also been made available.