SmolVLM: Redefining small and efficient multimodal models

📅 2025-04-07
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the high GPU memory consumption and low inference efficiency of large vision-language models (VLMs) on mobile and edge devices, this work proposes an architecture–tokenization–data co-optimization paradigm tailored for resource-constrained scenarios. Methodologically, we design a lightweight Transformer backbone, introduce sparse image/video tokenization strategies, and construct a high-quality, compact multimodal dataset trained via curriculum learning. Our key contributions are: (1) SmolVLM-256M achieves <1 GB GPU memory usage during inference while outperforming Idefics-80B in accuracy; (2) the 2.2B-parameter variant attains state-of-the-art performance on both image and video understanding tasks with significantly lower memory footprint; and (3) this is the first demonstration of a small-parameter VLM systematically surpassing ultra-large models across multimodal understanding benchmarks—establishing a new paradigm for efficient VLM deployment on edge devices.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.
Problem

Research questions and friction points this paper is trying to address.

Develop compact multimodal models for resource-efficient inference
Optimize architectural configurations for low computational overhead
Enhance performance with minimal GPU memory usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact multimodal models for resource-efficient inference
Optimized tokenization and data curation for low overhead
Strategic architectural optimizations enhance performance efficiently
🔎 Similar Papers
No similar papers found.