đ¤ AI Summary
This work addresses vision-driven embodied intelligence for humanoid robots, focusing on multi-task manipulationâincluding object search, grasping, placing, and joint-constrained interactionsâusing only egocentric visual input and without privileged information (e.g., 3D pose or geometric priors). We propose the âPerception-as-Interfaceâ paradigm, integrating end-to-end vision-based reinforcement learning, whole-body dynamics modeling, self-supervised visual representation learning, and sparse-reward optimization. Our approach is the first to elicit human-like behaviorsâsuch as active visual search and hand-eye coordinationânaturally within purely vision-based RL, enabling single-policy generalization across diverse tasks. Experiments demonstrate emergent synergy between active perception and dexterous manipulation under closed-loop visual control in complex household task sequences. The method significantly improves cross-task generalization and biological plausibility compared to prior approaches.
đ Abstract
Human behavior is fundamentally shaped by visual perception -- our ability to interact with the world depends on actively gathering relevant information and adapting our movements accordingly. Behaviors like searching for objects, reaching, and hand-eye coordination naturally emerge from the structure of our sensory system. Inspired by these principles, we introduce Perceptive Dexterous Control (PDC), a framework for vision-driven dexterous whole-body control with simulated humanoids. PDC operates solely on egocentric vision for task specification, enabling object search, target placement, and skill selection through visual cues, without relying on privileged state information (e.g., 3D object positions and geometries). This perception-as-interface paradigm enables learning a single policy to perform multiple household tasks, including reaching, grasping, placing, and articulated object manipulation. We also show that training from scratch with reinforcement learning can produce emergent behaviors such as active search. These results demonstrate how vision-driven control and complex tasks induce human-like behaviors and can serve as the key ingredients in closing the perception-action loop for animation, robotics, and embodied AI.