CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of vision-language models to misclassification among fine-grained, visually similar categories due to inherent biases and insufficient discriminative capacity. To mitigate this, the authors propose CAPT, a novel framework that explicitly models stable class confusion relationships for the first time. CAPT synergistically leverages a Semantic Confusion Miner (SEM) and a Sample Confusion Miner (SAM) to uncover multi-granularity confusion cues, which are then integrated with global and local contextual information through a Multi-Granularity Diverse Experts (MGDE) module and a Diff-Manner Adapter to enable confusion-aware reasoning. Evaluated across 11 benchmark datasets, CAPT substantially reduces confusion-induced errors, enhances both base and novel class discrimination and generalization, and successfully corrects 50.72% of previously misclassified confusing sample pairs.

Technology Category

Application Category

📝 Abstract
Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.
Problem

Research questions and friction points this paper is trying to address.

vision-language misalignment
category confusion
fine-grained discrimination
systematic misclassification
cross-modal representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Confusion-Aware Prompt Tuning
Vision-Language Misalignment
Semantic Confusion Miner
Sample Confusion Miner
Multi-Granularity Difference Expert
🔎 Similar Papers
No similar papers found.
M
Maoyuan Shao
School of Information Engineering, Minzu University of China
Yutong Gao
Yutong Gao
Nanjing University of Science and Technology
computer visionNLPAIGC
X
Xinyang Huang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
C
Chuang Zhu
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
Lijuan Sun
Lijuan Sun
Johns Hopkins Univerisity
CancerImmunotherapymolecular biology
Guoshun Nan
Guoshun Nan
Professor of Beijing University of Posts and Telecommunications
Multimodal LearningVideo LLM6G SecuritySemantic Communications