Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenges of depth ambiguity and inefficient action learning in vision-language-action (VLA) models stemming from monocular input by introducing a multi-view collaborative learning framework. The approach first leverages a pretrained multi-view diffusion model to synthesize novel latent viewpoints, then employs a Geometry-Guided Gated Transformer (G3T) to align and fuse multi-view features, thereby enhancing perceptual robustness. Concurrently, Action Manifold Learning (AML) constrains action predictions to lie on a valid manifold, circumventing inefficient regression toward unstructured targets. Evaluated on LIBERO, RoboTwin 2.0, and real-world robotic tasks, the method achieves state-of-the-art performance, significantly improving both task success rates and generalization capabilities.
📝 Abstract
This paper tackles spatial perception and manipulation challenges in Vision-Language-Action (VLA) models. To address depth ambiguity from monocular input, we leverage a pre-trained multi-view diffusion model to synthesize latent novel views and propose a Geometry-Guided Gated Transformer (G3T) that aligns multi-view features under 3D geometric guidance while adaptively filtering occlusion noise. To improve action learning efficiency, we introduce Action Manifold Learning (AML), which directly predicts actions on the valid action manifold, bypassing inefficient regression of unstructured targets like noise or velocity. Experiments on LIBERO, RoboTwin 2.0, and real-robot tasks show our method achieves superior success rate and robustness over SOTA baselines. Project page: https://junjxiao.github.io/Multi-view-VLA.github.io/.
Problem

Research questions and friction points this paper is trying to address.

spatial perception
depth ambiguity
action learning efficiency
robotic manipulation
Vision-Language-Action
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view Diffusion
Geometry-Guided Gated Transformer
Action Manifold Learning
Vision-Language-Action Models
3D Geometric Alignment