🤖 AI Summary
Current dual-system vision-language-action (VLA) architectures for embodied intelligence lack open-source benchmarks and reproducible analytical frameworks. Method: We conduct the first structured comparative study and empirical attribution analysis, proposing a lightweight, modular, and extensible open-source VLA paradigm. Our approach integrates a ViT-based visual encoder, a parameter-efficient LLM language decoder, an action head, and a cross-modal alignment mechanism, trained via joint instruction tuning and imitation learning. Contribution/Results: Evaluated on RT-2 and Open-X Embodiment benchmarks, our model achieves <1B parameters and <120ms inference latency while maintaining competitive performance. We fully open-source the codebase, pre-trained weights, and an integrated evaluation toolkit—enabling community-driven development, reproducible experimentation, and continuous advancement of VLA research.
📝 Abstract
Dual-system VLA (Vision-Language-Action) architectures have become a hot topic in embodied intelligence research, but there is a lack of sufficient open-source work for further performance analysis and optimization. To address this problem, this paper will summarize and compare the structural designs of existing dual-system architectures, and conduct systematic empirical evaluations on the core design elements of existing dual-system architectures. Ultimately, it will provide a low-cost open-source model for further exploration. Of course, this project will continue to update with more experimental conclusions and open-source models with improved performance for everyone to choose from. Project page: https://openhelix-robot.github.io/.