Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work addresses vision-driven embodied intelligence for humanoid robots, focusing on multi-task manipulation—including object search, grasping, placing, and joint-constrained interactions—using only egocentric visual input and without privileged information (e.g., 3D pose or geometric priors). We propose the “Perception-as-Interface” paradigm, integrating end-to-end vision-based reinforcement learning, whole-body dynamics modeling, self-supervised visual representation learning, and sparse-reward optimization. Our approach is the first to elicit human-like behaviors—such as active visual search and hand-eye coordination—naturally within purely vision-based RL, enabling single-policy generalization across diverse tasks. Experiments demonstrate emergent synergy between active perception and dexterous manipulation under closed-loop visual control in complex household task sequences. The method significantly improves cross-task generalization and biological plausibility compared to prior approaches.

Technology Category

Application Category

📝 Abstract

Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.

Problem

Research questions and friction points this paper is trying to address.

Develop vision-driven dexterous whole-body control for simulated humanoids

Enable object search and manipulation using only egocentric vision

Train policies to perform multiple household tasks without privileged state info

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-driven dexterous whole-body control framework

Egocentric vision for task specification without state info

Single policy for multiple household tasks learning

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids