The company Alibaba introduced the multimodal AI Qwen VLo, which analyzes, creates, and edits images based on text prompts. Qwen VLo generates images gradually, step by step, from left to right and top to bottom, allowing for better control over the result and is especially useful for long text descriptions.
The model understands complex natural language instructions. Users can change the background, add new objects, change the image style, and combine multiple images into one.
Qwen VLo supports both artistic and technical changes. It creates segmentation maps, performs contour detection, and forms depth maps with color overlays. The model also recognizes parts of the image and assesses the scene’s depth.
The system works with various image resolutions and proportions, including extreme formats like 4:1 or 1:3, although this feature is not yet activated. It processes requests in both Chinese and English.
Currently, Qwen VLo is available for exploration in Qwen Chat. The company reports some generation errors, source mismatches, and difficulties with executing detailed instructions but plans to improve the model’s stability and reliability.