UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing robotic policy learning heavily relies on large-scale, action-annotated datasets, suffers from poor generalization, and struggles with cross-embodiment and cross-environment transfer. Method: UniVLA introduces a task-centric, unified vision-language-action (VLA) framework that, for the first time, learns embodiment-agnostic, dynamics-invariant latent action representations from internet-scale videos. It leverages DINO-based visual features, integrates language instructions to suppress task-irrelevant motion, enables language-conditioned latent-space alignment, and employs a lightweight cross-embodiment action decoder. The framework supports continual learning from heterogeneous human demonstration videos. Contribution/Results: UniVLA achieves state-of-the-art performance on multi-task manipulation and navigation benchmarks, as well as real-world robots. Compared to OpenVLA, it reduces pretraining compute by >95%, cuts downstream annotation requirements by 90%, and demonstrates consistent performance gains with increasing data diversity.

Technology Category

Application Category

📝 Abstract

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA's potential to facilitate scalable and efficient robot policy learning.

Problem

Research questions and friction points this paper is trying to address.

Learning transferable robot actions across diverse environments and embodiments

Reducing reliance on large-scale action-annotated training data

Mitigating task-irrelevant dynamics in vision-language-action policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-centric latent actions from videos

Latent action model in DINO feature space

Efficient cross-embodiment latent action decoding

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)