Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In open-vocabulary multi-label image recognition, vision-language models like CLIP suffer from local semantic deficiency and coarse region–label alignment, leading to spurious predictions. To address this, we propose a two-stage decoupled framework: first, a Ladder Local Adapter (LLA) is designed to recover fine-grained regional semantics; second, a Knowledge-Constrained Optimal Transport (KCOT) mechanism is introduced to explicitly model structured region–label correspondences and suppress erroneous associations. Our method integrates regional feature decoupling, reweighting, and CLIP fine-tuning. Evaluated on three cross-domain benchmarks, it consistently outperforms prior approaches, achieving state-of-the-art performance in zero-shot multi-label classification accuracy and generalization to unseen categories.

Technology Category

Application Category

📝 Abstract
Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: https://github.com/EricTan7/RAM.
Problem

Research questions and friction points this paper is trying to address.

Recover local semantics disrupted by CLIP's global pre-training.
Improve matching between image regions and candidate labels.
Suppress spurious predictions from irrelevant image regions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ladder Local Adapter refocuses on local semantics.
Knowledge-Constrained Optimal Transport suppresses irrelevant matches.
RAM framework achieves state-of-the-art multi-label recognition.
🔎 Similar Papers
No similar papers found.
Hao Tan
Hao Tan
Adobe Research
Vision and Language3D Multimodal
Zichang Tan
Zichang Tan
Previously CASIA, Baidu Inc.;
Computer VisionBiometricsAutonomous DrivingRoboticsMLLM
J
Jun Li
MAIS, Institute of Automation, Chinese Academy of Sciences
A
Ajian Liu
MAIS, Institute of Automation, Chinese Academy of Sciences
J
Jun Wan
MAIS, Institute of Automation, Chinese Academy of Sciences; SAI, UCAS
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management