SCING:Towards More Efficient and Robust Person Re-Identification through Selective Cross-modal Prompt Tuning

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current CLIP-based person re-identification (ReID) methods suffer from suboptimal cross-modal alignment and high computational overhead due to insufficient modeling of visual–linguistic interactions. To address this, we propose a lightweight cross-modal prompt tuning framework. Our approach features: (1) a selective visual prompting fusion module that dynamically integrates textual semantics with image region features via a lightweight cross-gating mechanism; and (2) a perturbation-driven consistency alignment strategy, combining dual-path training with cross-modal embedding consistency regularization to enhance feature robustness. Unlike adapter-based methods, ours requires no complex architectural modifications. Extensive experiments on Market-1501 and DukeMTMC demonstrate substantial improvements over state-of-the-art CLIP-based ReID approaches—achieving Pareto-optimal trade-offs between accuracy and efficiency: +37% inference speedup and −52% FLOPs reduction, without compromising retrieval performance.

Technology Category

Application Category

📝 Abstract
Recent advancements in adapting vision-language pre-training models like CLIP for person re-identification (ReID) tasks often rely on complex adapter design or modality-specific tuning while neglecting cross-modal interaction, leading to high computational costs or suboptimal alignment. To address these limitations, we propose a simple yet effective framework named Selective Cross-modal Prompt Tuning (SCING) that enhances cross-modal alignment and robustness against real-world perturbations. Our method introduces two key innovations: Firstly, we proposed Selective Visual Prompt Fusion (SVIP), a lightweight module that dynamically injects discriminative visual features into text prompts via a cross-modal gating mechanism. Moreover, the proposed Perturbation-Driven Consistency Alignment (PDCA) is a dual-path training strategy that enforces invariant feature alignment under random image perturbations by regularizing consistency between original and augmented cross-modal embeddings. Extensive experiments are conducted on several popular benchmarks covering Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-REID, and P-DukeMTMC, which demonstrate the impressive performance of the proposed method. Notably, our framework eliminates heavy adapters while maintaining efficient inference, achieving an optimal trade-off between performance and computational overhead. The code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Improves cross-modal alignment in person re-identification
Reduces computational costs without complex adapters
Enhances robustness against real-world image perturbations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Cross-modal Prompt Tuning (SCING) framework
Selective Visual Prompt Fusion (SVIP) module
Perturbation-Driven Consistency Alignment (PDCA) strategy
🔎 Similar Papers
No similar papers found.