VITA: Vision-to-Action Flow Matching Policy

πŸ“… 2025-07-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-to-action generation methods rely on Gaussian noise priors and conditional mechanisms (e.g., cross-attention), incurring high computational overhead and struggling to model the manifold mapping between visual inputs and sparse, unstructured action sequences. Method: We propose VITAβ€”a novel end-to-end flow matching framework that uses image latent variables as the source distribution. It introduces a purely MLP-based architecture for unconditional vision-to-action mapping, unifying cross-modal manifold learning. A structured action latent space is constructed via an autoencoder, aligned with visual representations through upsampling; action reconstruction is achieved via latent flow decoding and ODE solving under reconstruction supervision. Results: Evaluated on the ALOHA platform, VITA achieves state-of-the-art or superior performance across five simulated and two real-world dual-arm robotic tasks, while reducing inference latency by 50–130%.

Technology Category

Application Category

πŸ“ Abstract
We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Evolves latent visual representations into actions for visuomotor control
Eliminates separate conditioning modules in action generation
Addresses dimensional mismatches between vision and action data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-to-action flow matching policy
Learns mapping from latent images to actions
Employs autoencoder for structured action space
πŸ”Ž Similar Papers
No similar papers found.