From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the limitations of existing CLIP-based person re-identification methods, which rely on global [CLS] feature aggregation and lack spatial selectivity, leading to degraded performance under occlusion and cross-camera scenarios. To overcome this, we propose SAGA-ReID, which introduces learnable anchor vectors in the CLIP text embedding space to enable structured alignment and weighted reconstruction of intermediate image patch features from the visual encoder—adaptively enhancing spatially stable regions while suppressing corrupted ones, all without requiring textual descriptions. By circumventing the global pooling bottleneck, our approach effectively disentangles identity-relevant representations from occlusion-induced interference. Extensive experiments demonstrate that SAGA-ReID significantly outperforms state-of-the-art methods on both standard and occluded ReID benchmarks, achieving up to a 10.6% improvement in Rank-1 accuracy under occlusion.

Technology Category

Application Category

📝 Abstract

CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.

Problem

Research questions and friction points this paper is trying to address.

person re-identification

CLIP

feature aggregation

occlusion

cross-camera variation

Innovation

Methods, ideas, or system contributions that make the work stand out.

feature aggregation

CLIP

person re-identification