From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot generalization to long-horizon robotic manipulation tasks remains a fundamental challenge, as existing methods struggle to decompose high-level instructions into executable action sequences solely from static visual inputs. This paper introduces Super-Mimic, a hierarchical framework that—uniquely—integrates video-driven multimodal intent parsing with prospective future dynamics prediction. Specifically, a multimodal reasoning model parses unscripted human demonstration videos to infer procedural intent; a generative model performs physically plausible video rollout predictions for each action step; and language-grounded subtask specifications guide the low-level controller. Evaluated on multiple long-horizon manipulation benchmarks, Super-Mimic achieves over 20% zero-shot performance improvement over prior state-of-the-art methods, significantly advancing task generalization and autonomous hierarchical decomposition capabilities in general-purpose robotic systems.

Technology Category

Application Category

📝 Abstract
Generalizing to long-horizon manipulation tasks in a zero-shot setting remains a central challenge in robotics. Current multimodal foundation based approaches, despite their capabilities, typically fail to decompose high-level commands into executable action sequences from static visual input alone. To address this challenge, we introduce Super-Mimic, a hierarchical framework that enables zero-shot robotic imitation by directly inferring procedural intent from unscripted human demonstration videos. Our framework is composed of two sequential modules. First, a Human Intent Translator (HIT) parses the input video using multimodal reasoning to produce a sequence of language-grounded subtasks. These subtasks then condition a Future Dynamics Predictor (FDP), which employs a generative model that synthesizes a physically plausible video rollout for each step. The resulting visual trajectories are dynamics-aware, explicitly modeling crucial object interactions and contact points to guide the low-level controller. We validate this approach through extensive experiments on a suite of long-horizon manipulation tasks, where Super-Mimic significantly outperforms state-of-the-art zero-shot methods by over 20%. These results establish that coupling video-driven intent parsing with prospective dynamics modeling is a highly effective strategy for developing general-purpose robotic systems.
Problem

Research questions and friction points this paper is trying to address.

Zero-shot generalization for long-horizon robotic manipulation tasks
Decomposing high-level commands into executable action sequences
Inferring procedural intent from unscripted human demonstration videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework for zero-shot robotic imitation
Translates human videos into language-grounded subtasks
Generates dynamics-aware visual trajectories for control
🔎 Similar Papers
No similar papers found.
K
Ke Ye
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
J
Jiaming Zhou
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Y
Yuanfeng Qiu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
J
Jiayi Liu
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
S
Shihui Zhou
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Kun-Yu Lin
Kun-Yu Lin
The University of Hong Kong
Computer VisionMachine Learning
Junwei Liang
Junwei Liang
Assistant Professor, HKUST (Guangzhou) | CSE, HKUST | Ph.D. @CMU
Computer VisionRoboticsEmbodied AITrajectory Prediction