SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing approaches to humanoid robot whole-body loco-manipulation rely on task-specific rewards, reference motion replay, or expensive teleoperation, hindering real-world generalization. This work proposes a three-stage framework that first automatically extracts motion and contact priors from unstructured human videos, then generates high-fidelity skills via physics-based optimization, and finally distills them into a hierarchical autonomous policy comprising an instruction generator and a tracker—enabling deployment without task rewards or reference motions. To our knowledge, this is the first method capable of generating generalizable, closed-loop robotic skills solely from in-the-wild human videos, supporting zero-shot real-world transfer and autonomous recovery. Experiments demonstrate significant outperformance over baselines across six tasks, with performance scaling with video data volume, and real-robot validation confirms robustness to disturbances, long-horizon stability, and self-recovery from failures.

📝 Abstract

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

Problem

Research questions and friction points this paper is trying to address.

humanoid robots

loco-manipulation

human videos

generalization

scalable learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

humanoid locomotion

video-driven learning

motion prior refinement