See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the poor out-of-distribution generalization of existing Vision-Language-Action (VLA) models, this paper introduces ViVLA—the first general robotic manipulation policy framework enabling “single-video zero-shot generalization.” Methodologically, it proposes a unified vision-language-action joint modeling architecture and an automated expert-agent paired data synthesis pipeline that leverages human demonstration videos to generate high-quality action trajectories, integrating diverse publicly available datasets for training. Its core contribution is test-time generalization: given only a single unseen expert demonstration video, ViVLA directly adapts to novel tasks and previously unencountered robot morphologies without fine-tuning. Experiments demonstrate substantial improvements: +30.2% success rate on unseen LIBERO tasks, +35.7% gain in cross-morphology video-based transfer, and +38.4% success rate on real-world unseen tasks—significantly advancing open-world adaptability in embodied intelligence.

Technology Category

Application Category

📝 Abstract

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert behavior and transferring it seamlessly to the agent. To enhance the performance of ViVLA, we develop a scalable expert-agent pair data generation pipeline capable of synthesizing paired trajectories from easily accessible human videos, further augmented by curated pairs from publicly available datasets. This pipeline produces a total of 892,911 expert-agent samples for training ViVLA. Experimental results demonstrate that our ViVLA is able to acquire novel manipulation skills from only a single expert demonstration video at test time. Our approach achieves over 30% improvement on unseen LIBERO tasks and maintains above 35% gains with cross-embodiment videos. Real-world experiments demonstrate effective learning from human videos, yielding more than 38% improvement on unseen tasks.

Problem

Research questions and friction points this paper is trying to address.

Enables robots to learn new tasks from a single video demonstration.

Improves generalization to unseen tasks beyond training distributions.

Transfers manipulation skills from human videos to robot agents.

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-shot video demonstration learning for robot tasks

Joint processing of expert video and robot observations

Scalable data generation from human videos for training

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Open-Vocabulary Action Localization With Iterative Visual Prompting

2024-08-30IEEE AccessCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)