π€ AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models that rely on discrete chain-of-thought (CoT) reasoning, which incurs high computational overhead and struggles to align with continuous perception and control. To overcome this, we propose LaRA-VLA, a novel framework that implicitly embeds multimodal CoT reasoning into a continuous latent space, unifying reasoning and action prediction without explicitly generating CoT tokens. A curriculum learning strategy enables a smooth transition from explicit to implicit reasoning, directly driving action generation. Experiments demonstrate that LaRA-VLA significantly outperforms current methods on long-horizon tasks in both simulation and real-world robotic settings, achieving up to a 90% reduction in inference latency and substantially improving the efficiency of real-time embodied control.
π Abstract
Vision-Language-Action (VLA) models benefit from chain-of-thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (\textbf{LaRA-VLA}), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets and evaluate LaRA-VLA on both simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA consistently outperforms state-of-the-art VLA methods while reducing inference latency by up to 90\% compared to explicit CoT-based approaches, demonstrating latent reasoning as an effective and efficient paradigm for real-time embodied control. Project Page: \href{https://loveju1y.github.io/Latent-Reasoning-VLA/}{LaRA-VLA Website}.