LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models struggle to explicitly model complex 3D physical interactions and exhibit limited generalization under unknown spatial dynamics. To address this, this work proposes the LaMP framework, which introduces single-step partial denoising of 3D scene flow as an implicit motion prior into VLA for the first time, circumventing the need for full multi-step reconstruction. LaMP employs a dual-expert architecture—comprising a Motion Expert and an Action Expert—fused via a gated cross-attention mechanism to effectively integrate motion and action information. The method achieves state-of-the-art performance across the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, demonstrating a 9.7% average improvement in success rate over the strongest baseline under out-of-distribution perturbations in LIBERO-Plus.

Technology Category

Application Category

📝 Abstract
We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
3D scene flow
robotic manipulation
spatial dynamics
implicit learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D scene flow
vision-language-action
latent motion prior
gated cross-attention
robotic manipulation
🔎 Similar Papers
No similar papers found.
Xinkai Wang
Xinkai Wang
Southeast University
Embodied AILLM reasoning
C
Chenyi Wang
Zhejiang University
Y
Yifu Xu
School of Artificial Intelligence, Shanghai Jiao Tong University
M
Mingzhe Ye
School of Artificial Intelligence, Shanghai Jiao Tong University
F
Fu-Cheng Zhang
Beihang University
J
Jialin Tian
School of Artificial Intelligence, Shanghai Jiao Tong University
Xinyu Zhan
Xinyu Zhan
Shanghai Jiao Tong University
L
Lifeng Zhu
Southeast University
C
Cewu Lu
School of Artificial Intelligence, Shanghai Jiao Tong University
L
Lixin Yang
School of Artificial Intelligence, Shanghai Jiao Tong University