ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval

๐Ÿ“… 2026-01-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses shortcut learning, spurious correlations, and intra-modal structural distortions caused by local alignment in text-guided person retrieval. To this end, the authors propose an unsupervised attention-guided fine-grained alignment framework that requires no additional supervision. The method innovatively leverages early-training-stage attention maps to construct a high-salience token bank and employs two core mechanismsโ€”Multi-layer Attention Robust Selection (MARS) and Adaptive Token Scheduling (ATS)โ€”to achieve reliable and non-redundant cross-modal alignment. Central to the framework are Attention-guided Representation with Bank (GRAB), MARS, and ATS. Extensive experiments demonstrate state-of-the-art performance on three mainstream benchmarks and strong cross-dataset generalization capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision Language Models (VLMs) have rapidly advanced and show strong promise for text-based person search (TBPS), a task that requires capturing fine-grained relationships between images and text to distinguish individuals. Previous methods address these challenges through local alignment, yet they are often prone to shortcut learning and spurious correlations, yielding misalignment. Moreover, injecting prior knowledge can distort intra-modality structure. Motivated by our finding that encoder attention surfaces spatially precise evidence from the earliest training epochs, and to alleviate these issues, we introduceITSELF, an attention-guided framework for implicit local alignment. At its core, Guided Representation with Attentive Bank (GRAB) converts the model's own attention into an Attentive Bank of high-saliency tokens and applies local objectives on this bank, learning fine-grained correspondences without extra supervision. To make the selection reliable and non-redundant, we introduce Multi-Layer Attention for Robust Selection (MARS), which aggregates attention across layers and performs diversity-aware top-k selection; and Adaptive Token Scheduler (ATS), which schedules the retention budget from coarse to fine over training, preserving context early while progressively focusing on discriminative details. Extensive experiments on three widely used TBPS benchmarks showstate-of-the-art performance and strong cross-dataset generalization, confirming the effectiveness and robustness of our approach without additional prior supervision. Our project is publicly available at https://trhuuloc.github.io/itself
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Retrieval
Text-Based Person Search
Fine-Grained Alignment
Shortcut Learning
Spurious Correlations
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention-guided alignment
fine-grained correspondence
vision-language retrieval
implicit local alignment
adaptive token selection
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tien-Huy Nguyen
University of Information Technology, Ho Chi Minh City, VIETNAM
H
Huu-Loc Tran
University of Information Technology, Ho Chi Minh City, VIETNAM
Thanh Duc Ngo
Thanh Duc Ngo
University of Information Technology, Vietnam National University Ho Chi Minh City, Vietnam
Computer VisionMultimedia Content Analysis