ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the prohibitively high cost of collecting real-world demonstration data for learning complex robotic manipulation skills, this paper proposes a cross-modal transfer framework based on Semantic Action Flow (SAF). SAF models the spatiotemporally invariant semantic structure of human–object interactions. It leverages self-supervised learning to extract manipulation priors from large-scale unlabeled human videos and unsupervisedly maps such videos into structured action flows. Subsequently, only a minimal number (1–5) of robot demonstrations are required for fine-tuning and adaptation. Crucially, SAF serves as a novel, knowledge-mediated representation bridging human video observation and robotic execution—enabling structured, generalizable cross-modal transfer. Our method achieves state-of-the-art performance on both the CALVIN benchmark and real-robot manipulation tasks, significantly outperforming existing few-shot imitation learning approaches.

Technology Category

Application Category

📝 Abstract

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

Problem

Research questions and friction points this paper is trying to address.

Reducing robot skill learning costs via human videos

Bridging human-robot skill transfer with semantic action flow

Enhancing robot performance in low-data regimes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning from unlabeled video data

Generative model pre-trained on semantic action flows

Fine-tuning with few robot demonstrations

🔎 Similar Papers

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation