Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

๐Ÿ“… 2025-10-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Vision-based reinforcement learning suffers from low sample efficiency and training instability due to high-dimensional image inputs containing abundant task-irrelevant pixels. This paper proposes reward-guided foveal attention, a novel mechanism that constructs contrastive triplets from return disparities between successful and failed trajectories, enabling self-supervised contrastive learning to automatically steer visual attention toward task-critical regionsโ€”without modifying the underlying RL algorithm. The key innovation lies in transforming return differences into differentiable attention supervision signals, facilitating end-to-end learning of visual feature selection. Evaluated on the ManiSkill3 manipulation benchmark, our method improves sample efficiency by up to 2.4ร— over baselines and, for the first time, achieves stable convergence across multiple complex tasks under standard training protocols.

Technology Category

Application Category

๐Ÿ“ Abstract
Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent's experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.4x improvement in sample efficiency and can solve tasks that the baseline fails to learn, demonstrated across a suite of manipulation tasks from the ManiSkill3 benchmark, all without modifying the underlying algorithm or hyperparameters.
Problem

Research questions and friction points this paper is trying to address.

Visual RL agents waste resources on irrelevant image features
Sample-inefficient learning due to attention on unimportant pixels
Need to identify task-relevant visual features automatically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable foveal attention mechanism guided by returns
Return-guided contrastive learning for feature distinction
Grouping visual representations by return differences into triplets
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Andrew Lee
Department of Computer Science, University of California, Davis
I
Ian T. Chuang
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
Dechen Gao
Dechen Gao
University of California, Davis
Robot LearningWorld ModelsReinforcement LearningAutonomous Driving
K
Kai Fukazawa
Department of Mechanical and Aerospace Engineering, University of California, Davis
Iman Soltani
Iman Soltani
Assistant Professor of Mechanical and Aerospace Engineering, University of California, Davis
RoboticsAutonomous DrivingDeep Learning for Medical DiagnosisInstrumentation and