LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
Existing vision-language-action (VLA) models exhibit limited adaptability in static imitation learning, while reinforcement learning approaches often neglect underlying physical reasoning. To address these limitations, this work proposes the LaST-R1 framework, which introduces Latent-to-Action Policy Optimization (LAPO)—a novel algorithm that integrates dynamic chain-of-thought reasoning in a continuous latent space with reinforcement learning-based post-training. LAPO adaptively adjusts its reasoning length to match environmental complexity and enables effective multimodal fusion. Requiring only single-example supervised warm-up, the method achieves a 99.8% average success rate on the LIBERO benchmark and demonstrates up to a 44% performance improvement on real-world robotic tasks, significantly outperforming current state-of-the-art approaches while exhibiting exceptional cross-domain generalization capabilities.
📝 Abstract
Vision-Language-Action (VLA) models have increasingly incorporated reasoning mechanisms for complex robotic manipulation. However, existing approaches share a critical limitation: whether employing explicit linguistic reasoning that suffers from latency and discretization, or utilizing more expressive continuous latent reasoning, they are predominantly confined to static imitation learning that limits adaptability and generalization. While online reinforcement learning (RL) has been introduced to VLAs to enable trial-and-error exploration, current methods exclusively optimize the vanilla action space, bypassing the underlying physical reasoning process. In this paper, we present \textbf{LaST-R1}, a unified VLA framework that integrates latent Chain-of-Thought (CoT) reasoning over physical dynamics prior to action execution, along with a tailored RL post-training paradigm. Specifically, we propose \textbf{Latent-to-Action Policy Optimization (LAPO)}, a novel RL algorithm that jointly optimizes the latent reasoning process and the action generation. By bridging reasoning and control, LAPO improves the representation of physical world modeling and enhances robustness in interactive environments. Furthermore, an \textbf{adaptive latent CoT mechanism} is introduced to allow the policy to dynamically adjust its reasoning horizon based on environment complexity. Extensive experiments show that LaST-R1 achieves a near-perfect 99.8\% average success rate on the LIBERO benchmark with only one-shot supervised warm-up, significantly improving convergence speed and performance over prior state-of-the-art methods. In real-world deployments, LAPO post-training yields up to a 44\% improvement over the initial warm-up policy across four complex tasks, including both single-arm and dual-arm settings. Finally, LaST-R1 demonstrates strong generalization across simulated and real-world environments.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
reinforcement learning
latent reasoning
physical dynamics
adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Reasoning
Reinforcement Learning
Vision-Language-Action
Chain-of-Thought
Policy Optimization
🔎 Similar Papers
No similar papers found.
Hao Chen
Hao Chen
CUHK
Embodied AIMulti-Modality Learning
J
Jiaming Liu
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Zhonghao Yan
Zhonghao Yan
Beijing University of Posts and Telecommunications
Vision Language ModelAgentGenerative AIMedical Image Analysis
N
Nuowei Han
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Renrui Zhang
Renrui Zhang
Seed ByteDance & MMLab & PKU
Large Multimodal ModelGenerative ModelEmbodied AI
Chenyang Gu
Chenyang Gu
Undergraduate, Peking University
Embodied AIRobotic Manipulation
Jialin Gao
Jialin Gao
National University of Singapore
Video Understanding Multi-modal Understanding
Ziyu Guo
Ziyu Guo
The Chinese University of Hong Kong
Multi-modality LearningLLM/VLMs3D Vision
S
Siyuan Qian
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Y
Yinxi Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
P
Peng Jia
Simplexity Robotics
C
Chi-Wing Fu
The Chinese University of Hong Kong
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
P
Pheng-Ann Heng
The Chinese University of Hong Kong