MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning latent action representations that are highly correlated with ground-truth actions from multi-view human videos without action labels, to facilitate pretraining of vision-language-action (VLA) models. To this end, the authors propose a cross-view reconstruction mechanism: discrete latent actions inferred from one temporally synchronized view must enable reconstruction of future states in another view. This enforces invariance to view-specific visual cues and enhances the semantic meaningfulness of the learned actions. The approach significantly increases mutual information between latent and ground-truth actions while maintaining robustness in out-of-distribution scenarios. Experiments demonstrate that the learned latent actions on Bridge V2 are more accurate and effectively improve downstream robotic manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Technology Category

Application Category

📝 Abstract
Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.
Problem

Research questions and friction points this paper is trying to address.

latent actions
vision-language-action models
multi-view videos
action representation
robot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent actions
cross-viewpoint reconstruction
multi-view video
vision-language-action models
action-centric representation
🔎 Similar Papers
No similar papers found.
J
Jung Min Lee
Seoul National University, Seoul, South Korea
D
Dohyeok Lee
Seoul National University, Seoul, South Korea
S
Seokhun Ju
Seoul National University, Seoul, South Korea
Taehyun Cho
Taehyun Cho
Seoul National University
Reinforcement Learning
J
Jin Woo Koo
Seoul National University, Seoul, South Korea
L
Li Zhao
Microsoft Research Asia, Beijing, China
S
Sangwoo Hong
Konkuk University, Seoul, South Korea
Jungwoo Lee
Jungwoo Lee
Professor, Department of Electrical and Computer Engineering, Seoul National University
Machine LearningDistributed ComputingInformation Theory