🤖 AI Summary
This work addresses the problem of passive, task-agnostic visual gaze in robotic hand-eye coordination. Methodologically, we propose a biomimetic active gaze framework comprising: (i) a rotatable mechanical eyeball hardware system integrating 360° panoramic vision with rendering-based viewpoint synthesis; (ii) a foveal attention encoding network enabling high-resolution, stable fixation under low computational overhead; and (iii) the first joint behavioral cloning (BC) and deep reinforcement learning (RL) paradigm for closed-loop perception–action control, enabling autonomous eye movement toward task-critical hand regions. Experiments demonstrate: (i) significantly improved success rates across five large-scale, surround-view manipulation tasks; (ii) seamless hand–eye coordination across a 180° workspace using a single camera; and (iii) emergent, human-like gaze behaviors—including target locking, motion prediction, and distractor suppression—without explicit supervision.
📝 Abstract
Humans do not passively observe the visual world -- we actively look in order to act. Motivated by this principle, we introduce EyeRobot, a robotic system with gaze behavior that emerges from the need to complete real-world tasks. We develop a mechanical eyeball that can freely rotate to observe its surroundings and train a gaze policy to control it using reinforcement learning. We accomplish this by first collecting teleoperated demonstrations paired with a 360 camera. This data is imported into a simulation environment that supports rendering arbitrary eyeball viewpoints, allowing episode rollouts of eye gaze on top of robot demonstrations. We then introduce a BC-RL loop to train the hand and eye jointly: the hand (BC) agent is trained from rendered eye observations, and the eye (RL) agent is rewarded when the hand produces correct action predictions. In this way, hand-eye coordination emerges as the eye looks towards regions which allow the hand to complete the task. EyeRobot implements a foveal-inspired policy architecture allowing high resolution with a small compute budget, which we find also leads to the emergence of more stable fixation as well as improved ability to track objects and ignore distractors. We evaluate EyeRobot on five panoramic workspace manipulation tasks requiring manipulation in an arc surrounding the robot arm. Our experiments suggest EyeRobot exhibits hand-eye coordination behaviors which effectively facilitate manipulation over large workspaces with a single camera. See project site for videos: https://www.eyerobot.net/