🤖 AI Summary
This work addresses the challenge of attributing performance gains in vision-language-action (VLA) models, which is often obscured by divergent training strategies and implementation details. To establish a transparent and reproducible baseline, the authors propose a minimalist yet high-performing VLA architecture that decouples perception and control modules. The design leverages a standard vision-language backbone, a lightweight action prediction head, and a unified training pipeline, eliminating the need for robot-specific pretraining. Despite having only 0.5 billion parameters, the model surpasses several multi-billion-parameter counterparts on simulation benchmarks and achieves performance on real-world robotic tasks comparable to that of pi0.5, demonstrating that architectural clarity and consistent training protocols can yield significant efficiency and effectiveness gains.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic manipulation, leveraging large-scale pre-training to achieve strong performance. The field has rapidly evolved with additional spatial priors and diverse architectural innovations. However, these advancements are often accompanied by varying training recipes and implementation details, which can make it challenging to disentangle the precise source of empirical gains. In this work, we introduce SimVLA, a streamlined baseline designed to establish a transparent reference point for VLA research. By strictly decoupling perception from control, using a standard vision-language backbone and a lightweight action head, and standardizing critical training dynamics, we demonstrate that a minimal design can achieve state-of-the-art performance. Despite having only 0.5B parameters, SimVLA outperforms multi-billion-parameter models on standard simulation benchmarks without robot pretraining. SimVLA also reaches on-par real-robot performance compared to pi0.5. Our results establish SimVLA as a robust, reproducible baseline that enables clear attribution of empirical gains to future architectural innovations. Website: https://frontierrobo.github.io/SimVLA