From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing vision-augmentation methods (e.g., random cropping) improve zero-shot performance of vision-language models (VLMs) like CLIP but degrade global semantic understanding by introducing background noise and overemphasizing local details. To address this, we propose the **Attention-guided Cropping and Soft-matching Filtering framework (ABS)**—a training-free, attention-driven selection mechanism that adaptively focuses on salient regions in both pixel and feature spaces. ABS is the first method to jointly realize attention-guided multi-view cropping and feature-level selection, further enhanced by soft-matching filtering using LLM-generated image descriptions to strengthen cross-modal alignment. Crucially, ABS requires no fine-tuning or parameter updates. Evaluated on out-of-distribution generalization and zero-shot classification benchmarks, it achieves state-of-the-art performance—matching or surpassing few-shot and test-time adaptation approaches—while preserving computational efficiency and model integrity.

Technology Category

Application Category

📝 Abstract

Pretrained vision-language models (VLMs), e.g., CLIP, demonstrate impressive zero-shot capabilities on downstream tasks. Prior research highlights the crucial role of visual augmentation techniques, like random cropping, in alignment with fine-grained class descriptions generated by large language models (LLMs), significantly enhancing zero-shot performance by incorporating multi-view information. However, the inherent randomness of these augmentations can inevitably introduce background artifacts and cause models to overly focus on local details, compromising global semantic understanding. To address these issues, we propose an extbf{A}ttention- extbf{B}ased extbf{S}election ( extbf{ABS}) method from local details to global context, which applies attention-guided cropping in both raw images and feature space, supplement global semantic information through strategic feature selection. Additionally, we introduce a soft matching technique to effectively filter LLM descriptions for better alignment. extbf{ABS} achieves state-of-the-art performance on out-of-distribution generalization and zero-shot classification tasks. Notably, extbf{ABS} is training-free and even rivals few-shot and test-time adaptation methods. Our code is available at href{https://github.com/BIT-DA/ABS}{ extcolor{darkgreen}{https://github.com/BIT-DA/ABS}}.

Problem

Research questions and friction points this paper is trying to address.

Addresses randomness in visual augmentation causing background artifacts

Enhances global semantic understanding in vision-language models

Improves alignment of LLM descriptions with visual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-Based Selection for guided cropping

Soft matching for LLM description filtering

Training-free method with global semantic enhancement

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs