The Hugging Face team has introduced new AI models — SmolVLM-256M and SmolVLM-500M. These models are capable of analyzing images, short videos, and text. They are designed to run on devices with limited resources, such as laptops with less than one gigabyte of RAM.
SmolVLM-256M and SmolVLM-500M have 256 million and 500 million parameters, respectively. They can perform tasks such as describing images or videos and answering questions about PDF documents, including scanned text and diagrams. The models were trained using The Cauldron and Docmatix datasets, created by the M4 team at Hugging Face.
Interestingly, the new models outperform the much larger Idefics 80B model in tests involving the analysis of science diagrams for school students. SmolVLM-256M and SmolVLM-500M are available on the Hugging Face website and can be downloaded without restrictions thanks to the Apache 2.0 license.
Although smaller models like SmolVLM-256M and SmolVLM-500M can be cost-effective and versatile, they may also have shortcomings that are less pronounced in larger models. Research has shown that many small models perform worse on complex logical reasoning tasks. This may be because smaller models only recognize superficial patterns in the data.