Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

263K/year

🤖 AI Summary

This study addresses the low learning efficiency and poor robustness of robotic vision systems by proposing a brain-inspired visual framework inspired by human active gaze mechanisms. Methodologically, it integrates eye-tracking data with an active vision system, designs a foveated patch partitioning strategy to dynamically allocate Vision Transformer computational resources toward task-critical regions, and establishes an end-to-end policy jointly optimizing gaze imitation and action prediction. Key contributions include: (1) the first trainable, gaze-guided mechanism for robotic learning; (2) a significant reduction in computational overhead—approximately 40% on the AV-ALOHA platform; (3) improved high-precision manipulation performance, with average task success rate increased by 12.3%; and (4) enhanced robustness against unknown disturbances such as illumination variations and occlusions. Experimental results validate the effectiveness of human-like active vision in improving perception-decision coordination in robotics.

Technology Category

Application Category

📝 Abstract

Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems. https://ian-chuang.github.io/gaze-av-aloha/

Problem

Research questions and friction points this paper is trying to address.

Enhancing robot learning efficiency via human-like gaze

Integrating foveated vision transformers to reduce computation

Improving task performance and robustness using gaze imitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Foveated Vision Transformers for efficient processing

Human gaze integration into robot policies

Joint gaze and action prediction end-to-end

🔎 Similar Papers

VIP: Vision Instructed Pre-training for Robotic Manipulation