LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to explicitly model complex 3D physical interactions and exhibit limited generalization under unknown spatial dynamics. To address this, this work proposes the LaMP framework, which introduces single-step partial denoising of 3D scene flow as an implicit motion prior into VLA for the first time, circumventing the need for full multi-step reconstruction. LaMP employs a dual-expert architecture—comprising a Motion Expert and an Action Expert—fused via a gated cross-attention mechanism to effectively integrate motion and action information. The method achieves state-of-the-art performance across the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, demonstrating a 9.7% average improvement in success rate over the strongest baseline under out-of-distribution perturbations in LIBERO-Plus.

Technology Category

Application Category

📝 Abstract

We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

3D scene flow

robotic manipulation

spatial dynamics

implicit learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D scene flow

vision-language-action

latent motion prior