🤖 AI Summary
Current vision-language-action (VLA) models exhibit limited generalization and deployment reliability in embodied tasks—particularly suffering significant performance degradation when transferring across robot morphologies or to real-world environments. To address this, we propose a reward-driven post-training framework: (1) an action-conditioned world model; (2) a heuristic reward function grounded in real-action deviation to synthesize high-quality preference data; and (3) joint fine-tuning via flow-matching action experts and direct preference optimization (DPO), eliminating the need for reinforcement learning. Our method substantially improves adaptability to unseen morphologies and dynamic environments. It achieves state-of-the-art performance on both simulated and real-robot benchmarks. Empirical results validate that reward modeling and preference-based optimization are critical for enhancing VLA robustness and reliability, establishing a new paradigm for practical deployment of embodied intelligence.
📝 Abstract
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.