DMPT: Decoupled Modality-Aware Prompt Tuning for Multi-Modal Object Re-Identification

📅 2025-02-26
🏛️ IEEE Workshop/Winter Conference on Applications of Computer Vision
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitive computational and memory overhead of full fine-tuning large vision transformers (ViTs) in multimodal object re-identification (re-ID), this paper proposes DMPT, a lightweight decoupled prompt tuning framework. Methodologically, DMPT introduces three key innovations: (1) an explicit decoupling mechanism that separates modality-specific prompts from modality-agnostic semantic prompts; (2) Prompt Inverse Bind (PromptIBind), a cross-modal prompt binding strategy enabling semantic alignment and complementary interaction; and (3) text-encoder-guided modality prior modeling to support joint representation learning across visible, near-infrared, and thermal infrared modalities. Under frozen ViT backbones, DMPT fine-tunes only 6.5% of parameters, yet achieves state-of-the-art performance on multiple benchmarks. This demonstrates a significantly improved trade-off between efficiency and accuracy in multimodal re-ID.

Technology Category

Application Category

📝 Abstract
Current multi-modal object re-identification approaches based on large-scale pre-trained backbones (i.e., ViT) have displayed remarkable progress and achieved excellent performance. However, these methods usually adopt the standard full fine-tuning paradigm, which requires the optimization of considerable backbone parameters, causing extensive computational and storage requirements. In this work, we propose an efficient prompt-tuning framework tailored for multi-modal object re-identification; dubbed DMPT, which freezes the main backbone and only optimizes several newly added decoupled modality-aware parameters. Specifically, we explicitly decouple the visual prompts into modality-specific prompts which leverage prior modality knowledge from a powerful text encoder and modality-independent semantic prompts which extract semantic information from multi-modal inputs, such as visible, near-infrared, and thermal-infrared. Built upon the extracted features, we further design a Prompt Inverse Bind (PromptIBind) strategy that employs bind prompts as a medium to connect the semantic prompt tokens of different modalities and facilitates the exchange of complementary multi-modal information, boosting final re-identification results. Experimental results on multiple common benchmarks demonstrate that our DMPT can achieve competitive results to existing state-of-the-art methods while requiring only 6.5% fine-tuning of the backbone parameters.
Problem

Research questions and friction points this paper is trying to address.

Optimizes multi-modal object re-identification with minimal backbone tuning
Decouples visual prompts into modality-specific and semantic components
Enhances multi-modal information exchange via Prompt Inverse Bind strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled modality-aware prompt tuning framework
Modality-specific and semantic prompts design
Prompt Inverse Bind strategy for multi-modal fusion
🔎 Similar Papers
No similar papers found.
M
Minghui Lin
Huazhong University of Science and Technology
S
Shu Wang
Shandong University
X
Xiang Wang
Huazhong University of Science and Technology
Jianhua Tang
Jianhua Tang
Shien-Ming Wu School of Intelligent Engineering, South China University of Technology
6GEdge ComputingNetwork SlicingIndustrial Internet of ThingsIndustrial AI
L
Longbin Fu
Huazhong University of Science and Technology
Z
Zhengrong Zuo
Huazhong University of Science and Technology
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition