Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Pre-trained visual representations (PVRs) encode abundant task-agnostic scene information, undermining the robustness of vision-based motor policies under out-of-distribution visual variations and distractors. To address this, we propose a lightweight, learnable attention-based feature aggregation mechanism that dynamically focuses policy learning on task-relevant visual cues while suppressing semantically redundant interference—without fine-tuning the backbone network or relying on data augmentation. Our method operates on deep features from large-scale pre-trained models and integrates attention-weighted pooling into end-to-end policy learning. We validate its effectiveness in both simulation and real-world robotic environments. Experiments demonstrate that, compared to standard pooling, our approach significantly improves policy robustness against visual perturbations—including illumination changes, occlusions, and cluttered backgrounds—as well as cross-environment generalization performance.

Technology Category

Application Category

📝 Abstract

The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa

Problem

Research questions and friction points this paper is trying to address.

Addressing visuomotor policy vulnerability to visual perturbations and distractors

Improving robustness by learning to attend to task-relevant visual cues

Reducing sensitivity to out-of-domain visual changes without expensive fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

AFA learns to attend to task-relevant visual cues

Lightweight pooling mechanism ignores scene distractors

No dataset augmentation or PVR fine-tuning required

🔎 Similar Papers

Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies