VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Humanoid robots lack autonomous locomanipulation capabilities, hindering real-world deployment. This paper proposes a zero-shot vision-driven sim-to-real transfer framework based on a teacher–student architecture: the teacher policy is trained via reinforcement learning in large-scale, tile-rendered simulation; the student network learns online via a hybrid of DAgger and behavioral cloning, augmented with visual domain randomization, sensor latency modeling, and eye-hand alignment to reality. To our knowledge, this is the first method enabling long-horizon autonomous locomanipulation from monocular RGB input alone—without any real-world fine-tuning. Evaluated on the Unitree G1 humanoid, it achieves up to 54 consecutive task cycles across visually and structurally diverse environments, demonstrating strong generalization and performance approaching expert teleoperation levels.

Technology Category

Application Category

📝 Abstract

A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.

Problem

Research questions and friction points this paper is trying to address.

Developing autonomous loco-manipulation skills for humanoid robots in real-world deployment

Bridging the sim-to-real gap for vision-based humanoid control without real-world fine-tuning

Enabling zero-shot transfer of learned policies from simulation to physical hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual sim-to-real framework for humanoid robots

Teacher-student design with privileged RL teacher

Large-scale simulation and domain randomization

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids