The Alibaba team has presented a new series of AI models called Qwen2.5-VL. These models can perform a variety of tasks involving text and image analysis, including object recognition in images, document analysis, and video understanding. The models can also control PCs, similar to the functionality of the Operator model from OpenAI. According to test results, Qwen2.5-VL outperforms GPT-4o from OpenAI, Claude 3.5 Sonnet from Anthropic, and Gemini 2.0 Flash from Google.
Qwen2.5-VL is available for testing in the Qwen Chat app and on the Hugging Face platform. It can analyze graphs and charts, extract data from scanned invoices and forms, and also understand videos several hours long. The model can recognize IP from movies and TV series, as well as various products, which suggests possible training on copyrighted materials.
One of the interesting features of Qwen2.5-VL is its ability to interact with software on PCs and mobile devices. For example, it can launch applications and perform tasks such as booking flights through mobile apps. This opens up new possibilities for automation and simplifying the use of various services.
The Qwen2.5-VL series includes several models, of which two smaller ones, Qwen2.5-VL-3B and Qwen2.5-VL-7B, are available under a liberal license. The most powerful model, Qwen2.5-VL-72B, has a special license from Alibaba.