MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the challenge of learning latent action representations that are highly correlated with ground-truth actions from multi-view human videos without action labels, to facilitate pretraining of vision-language-action (VLA) models. To this end, the authors propose a cross-view reconstruction mechanism: discrete latent actions inferred from one temporally synchronized view must enable reconstruction of future states in another view. This enforces invariance to view-specific visual cues and enhances the semantic meaningfulness of the learned actions. The approach significantly increases mutual information between latent and ground-truth actions while maintaining robustness in out-of-distribution scenarios. Experiments demonstrate that the learned latent actions on Bridge V2 are more accurate and effectively improve downstream robotic manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Technology Category

Application Category

📝 Abstract

Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Problem

Research questions and friction points this paper is trying to address.

latent actions

vision-language-action models

multi-view videos

action representation

robot learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent actions

cross-viewpoint reconstruction

multi-view video