SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Current vision-language-action (VLA) models suffer from excessive parameter counts, prohibitive training costs, and reliance on centralized, curated datasets—overlooking abundant, low-cost robot data collected by the broader robotics community. This work introduces a lightweight VLA model addressing these limitations. Methodologically, it is trained exclusively on real-world, community-sourced robot data; employs a compact Transformer architecture with multimodal instruction tuning; proposes a novel asynchronous inference stack that decouples perception/action prediction from execution—enabling chunked action generation and high-frequency control; and incorporates action caching alongside CPU/GPU cross-platform deployment optimizations. The resulting model attains only 10% of the parameters of state-of-the-art (SOTA) VLAs, supports training on a single GPU and inference on CPU, yet achieves SOTA generalization performance across both simulation and real-robot tasks.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

Problem

Research questions and friction points this paper is trying to address.

Reducing training and inference costs for vision-language-action models

Enabling deployment on affordable hardware like consumer GPUs or CPUs

Leveraging community-collected data for efficient robotic control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Small efficient community-driven VLA model

Single GPU training and consumer deployment

Asynchronous inference for higher control rates

🔎 Similar Papers

TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation