🤖 AI Summary
Pre-trained visual representations (PVRs) encode abundant task-agnostic scene information, undermining the robustness of vision-based motor policies under out-of-distribution visual variations and distractors. To address this, we propose a lightweight, learnable attention-based feature aggregation mechanism that dynamically focuses policy learning on task-relevant visual cues while suppressing semantically redundant interference—without fine-tuning the backbone network or relying on data augmentation. Our method operates on deep features from large-scale pre-trained models and integrates attention-weighted pooling into end-to-end policy learning. We validate its effectiveness in both simulation and real-world robotic environments. Experiments demonstrate that, compared to standard pooling, our approach significantly improves policy robustness against visual perturbations—including illumination changes, occlusions, and cluttered backgrounds—as well as cross-environment generalization performance.
📝 Abstract
The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa