Mind the GAP: Glimpse-based Active Perception improves generalization and sample efficiency of visual reasoning

📅 Unknown Date
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Humans effortlessly discern visual relationships (e.g., same/different) between unseen objects, whereas current AI systems suffer from poor generalization and low sample efficiency. To address this, we propose the Gaze-Augmented Perception (GAP) framework—a novel active perception paradigm grounded in sequential glimpses. GAP explicitly incorporates low-dimensional, eye-movement-inspired spatial signals into visual relationship modeling via three key mechanisms: (i) sequential attention to salient regions, (ii) positional embedding encoding, and (iii) multi-stage fusion of high-resolution local features—enabling a principled transition from pixel-level to structural-level relational understanding. Glimpse selection is flexibly supported by either reinforcement learning or differentiable attention, and GAP integrates an explicit relational reasoning module. Evaluated on multiple visual reasoning benchmarks, GAP achieves state-of-the-art performance, with substantial improvements in both sample efficiency and out-of-distribution generalization.

Technology Category

Application Category

📝 Abstract
Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization in visual reasoning tasks
Enhances sample efficiency for unseen objects
Leverages active perception to extract visual relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Glimpse-based Active Perception for visual reasoning
Leverages eye movement locations and visual content
Improves generalization and sample efficiency
O
Oleh Kolner
IBM Research Europe - Zurich
T
Thomas Ortner
IBM Research Europe - Zurich
S
Stanisław Woźniak
IBM Research Europe - Zurich
Angeliki Pantazi
Angeliki Pantazi
Principal Research Staff Member, IBM Research-Zurich
Neuromorphic Computing