NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models exhibit limited generalization and deployment reliability in embodied tasks—particularly suffering significant performance degradation when transferring across robot morphologies or to real-world environments. To address this, we propose a reward-driven post-training framework: (1) an action-conditioned world model; (2) a heuristic reward function grounded in real-action deviation to synthesize high-quality preference data; and (3) joint fine-tuning via flow-matching action experts and direct preference optimization (DPO), eliminating the need for reinforcement learning. Our method substantially improves adaptability to unseen morphologies and dynamic environments. It achieves state-of-the-art performance on both simulated and real-robot benchmarks. Empirical results validate that reward modeling and preference-based optimization are critical for enhancing VLA robustness and reliability, establishing a new paradigm for practical deployment of embodied intelligence.

Technology Category

Application Category

📝 Abstract
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.
Problem

Research questions and friction points this paper is trying to address.

Improving reliability and generalization of vision-language-action models
Enhancing embodied task performance across different environments
Addressing action quality and goal achievement in VLA models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-matching-based action expert enhances VLA architecture
Action-conditioned world model and deviation heuristic form rewards
Direct preference optimization adapts model using reward signals
🔎 Similar Papers
No similar papers found.