NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing vision-language-action (VLA) models exhibit strong zero-shot generalization for embodied tasks but suffer from weak visual encoding—leading to frequent grasping failures—and excessive parameter counts (>7B), resulting in high computational overhead that hinders real-time robotic deployment. To address these limitations, we propose a lightweight VLA framework built upon the 3B-parameter Qwen-2.5-VL-3B vision-language backbone—the first such adoption in VLA. Our method integrates 970K real-world robot demonstrations, vision-semantic alignment, action sequence distillation, and FAST+—an efficient action tokenization technique. Compared to 7B-scale VLA models, our approach achieves superior zero-shot task performance while accelerating inference by 2.1× and reducing GPU memory consumption by 58%. Empirical evaluation on edge hardware demonstrates real-time control at 25 Hz, striking a compelling balance among accuracy, efficiency, and deployability.

Technology Category

Application Category

📝 Abstract

Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our model{} is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in Visual-Language-Action models

Improving visual encoding for better task execution

Enhancing real-time robotic autonomy with efficient models

Innovation

Methods, ideas, or system contributions that make the work stand out.

3B-parameter model reduces computational overhead

Qwen-2.5-VL-3B backbone enhances visual-semantic understanding

FAST+ tokenizer enables efficient action sequence generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs