Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of scarce labeled data in the target domain for text-based person retrieval, where conventional pretraining-finetuning paradigms often fail. To overcome this limitation, the authors propose an unsupervised offline test-time adaptation method (UATTA) that operates without target-domain annotations. UATTA leverages a bidirectional retrieval inconsistency mechanism to estimate sample uncertainty and dynamically recalibrates the model accordingly, effectively mitigating domain shift. By integrating cross-modal foundation models such as CLIP and XVLM, the proposed approach consistently outperforms existing offline adaptation methods across four standard benchmarks—CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB—demonstrating enhanced practicality for unsupervised deployment scenarios.

Technology Category

Application Category

📝 Abstract

Text-based person search faces inherent limitations due to data scarcity, driven by stringent privacy constraints and the high cost of manual annotation. To mitigate this, existing methods usually rely on a Pretrain-then-Finetune paradigm, where models are first pretrained on synthetic person-caption data to establish cross-modal alignment, followed by fine-tuning on labeled real-world datasets. However, this paradigm lacks practicality in real-world deployment scenarios, where large-scale annotated target-domain data is typically inaccessible. In this work, we propose a new Pretrain-then-Adapt paradigm that eliminates reliance on extensive target-domain supervision through an offline test-time adaptation manner, enabling dynamic model adaptation using only unlabeled test data with minimal post-train time cost. To mitigate overconfidence with false positives of previous entropy-based test-time adaptation, we propose an Uncertainty-Aware Test-Time Adaptation (UATTA) framework, which introduces a bidirectional retrieval disagreement mechanism to estimate uncertainty, i.e., low uncertainty is assigned when an image-text pair ranks highly in both image-to-text and text-to-image retrieval, indicating high alignment; otherwise, high uncertainty is detected. This indicator drives offline test-time model recalibration without labels, effectively mitigating domain shift. We validate UATTA on four benchmarks, i.e., CUHK-PEDES, ICFG-PEDES, RSTPReid, and PAB, showing consistent improvements across both CLIP-based (one-stage) and XVLM-based (two-stage) frameworks. Ablation studies confirm that UATTA outperforms existing offline test-time adaptation strategies, establishing a new benchmark for label-efficient, deployable person search systems. Our code is available at https://github.com/nkuzjh/UATTA.

Problem

Research questions and friction points this paper is trying to address.

text-based person search

test-time adaptation

domain shift

data scarcity

uncertainty estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Adaptation

Uncertainty Estimation

Text-based Person Search