SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

๐Ÿ“… 2026-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

230K/year
๐Ÿค– AI Summary
Existing approaches to humanoid robot whole-body loco-manipulation rely on task-specific rewards, reference motion replay, or expensive teleoperation, hindering real-world generalization. This work proposes a three-stage framework that first automatically extracts motion and contact priors from unstructured human videos, then generates high-fidelity skills via physics-based optimization, and finally distills them into a hierarchical autonomous policy comprising an instruction generator and a trackerโ€”enabling deployment without task rewards or reference motions. To our knowledge, this is the first method capable of generating generalizable, closed-loop robotic skills solely from in-the-wild human videos, supporting zero-shot real-world transfer and autonomous recovery. Experiments demonstrate significant outperformance over baselines across six tasks, with performance scaling with video data volume, and real-robot validation confirms robustness to disturbances, long-horizon stability, and self-recovery from failures.
๐Ÿ“ Abstract
Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/
Problem

Research questions and friction points this paper is trying to address.

humanoid robots
loco-manipulation
human videos
generalization
scalable learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

humanoid locomotion
video-driven learning
motion prior refinement
zero-shot transfer
hierarchical policy distillation
T
Tianshu Wu
CFCS, School of Computer Science, Peking University
X
Xiangqi Kong
School of Computer Science and Engineering, Beihang University
Yue Chen
Yue Chen
Peking University
RoboticsLarge Language Model
Q
Qize Yu
CFCS, School of Computer Science, Peking University
H
Hang Ye
CFCS, School of Computer Science, Peking University
Jia Li
Jia Li
Peking University
Intelligent Software Engineering Natural Language Processing
Y
Yizhou Wang
CFCS, School of Computer Science, Peking University
Hao Dong
Hao Dong
Tenured Associate Professor at Peking University
Embodied AIRobotics3D VisionRobot LearningReinforcement Learning