AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low data efficiency, poor generalization, and susceptibility to confounding factors and spurious correlations in visual imitation learning, this paper proposes a human-annotation-free attention regularization method. It leverages vision-language models (VLMs) to automatically generate temporal saliency maps, guiding the policy to attend to causally relevant, task-critical regions. This is the first work to employ VLM-driven saliency generation for attention supervision in imitation learning—eliminating reliance on privileged signals such as eye-tracking data. The approach integrates behavioral cloning with saliency-guided attention regularization. Evaluated on both CARLA simulation and real-world robotic manipulation tasks, it consistently outperforms standard behavioral cloning and state-of-the-art methods requiring human supervision. Notably, it achieves a 37% improvement in sample efficiency while simultaneously enhancing policy generalization and robustness.

Technology Category

Application Category

📝 Abstract
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Improving data efficiency in visual imitation learning without human annotations
Guiding policies to focus on task-relevant features using VLM-generated saliency maps
Automatically identifying key objects in demonstrations to suppress visual distractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models to identify key objects
Generates temporal saliency maps without human annotations
Regularizes behavior cloning policies with saliency maps
🔎 Similar Papers
No similar papers found.
L
Litian Gong
Department of Electrical and Computer Engineering, University of California, Riverside, USA
F
Fatemeh Bahrani
Thomas Lord Department of Computer Science, University of Southern California, USA
Y
Yutai Zhou
Thomas Lord Department of Computer Science, University of Southern California, USA
Amin Banayeeanzade
Amin Banayeeanzade
Graduate Research Assistant, University of Southern California
Artificial General IntelligenceMachine LearningContinual Learning
J
Jiachen Li
Department of Electrical and Computer Engineering, University of California, Riverside, USA
Erdem Bıyık
Erdem Bıyık
Assistant Professor, University of Southern California
RoboticsHuman-Robot InteractionMachine LearningArtificial Intelligence