Semantic Relation-Enhanced CLIP Adapter for Domain Adaptive Zero-Shot Learning

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In domain-adaptive zero-shot learning (DAZSL), existing methods struggle to jointly achieve cross-domain transfer and cross-category generalization—especially under high annotation costs and scarce target-domain data—largely failing to harness the semantic generalization capacity of vision-language models (e.g., CLIP), resulting in inefficient knowledge transfer and degraded cross-modal alignment during fine-tuning. To address this, we introduce CLIP into the DAZSL framework for the first time, proposing a semantic relational structure loss and a cross-modal alignment preservation strategy. These jointly model the semantic topology among categories while enforcing consistency between visual and textual embedding spaces. Our approach achieves significant improvements over state-of-the-art methods on the I2AwA and I2WebV benchmarks, demonstrating superior effectiveness and robustness in joint generalization to both unseen classes and the target domain.

Technology Category

Application Category

📝 Abstract
The high cost of data annotation has spurred research on training deep learning models in data-limited scenarios. Existing paradigms, however, fail to balance cross-domain transfer and cross-category generalization, giving rise to the demand for Domain-Adaptive Zero-Shot Learning (DAZSL). Although vision-language models (e.g., CLIP) have inherent advantages in the DAZSL field, current studies do not fully exploit their potential. Applying CLIP to DAZSL faces two core challenges: inefficient cross-category knowledge transfer due to the lack of semantic relation guidance, and degraded cross-modal alignment during target domain fine-tuning. To address these issues, we propose a Semantic Relation-Enhanced CLIP (SRE-CLIP) Adapter framework, integrating a Semantic Relation Structure Loss and a Cross-Modal Alignment Retention Strategy. As the first CLIP-based DAZSL method, SRE-CLIP achieves state-of-the-art performance on the I2AwA and I2WebV benchmarks, significantly outperforming existing approaches.
Problem

Research questions and friction points this paper is trying to address.

Addresses inefficient cross-category knowledge transfer in domain adaptation
Solves degraded cross-modal alignment during target domain fine-tuning
Enhances CLIP's performance for domain-adaptive zero-shot learning scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Relation Structure Loss enhances knowledge transfer
Cross-Modal Alignment Retention Strategy maintains feature consistency
Adapter framework integrates semantic relations with CLIP model
🔎 Similar Papers
No similar papers found.
J
Jiaao Yu
School of Computer Science and Technology, East China Normal University, China
M
Mingjie Han
School of Computer Science and Technology, East China Normal University, China
J
Jinkun Jiang
College of Computer Science and Technology, Ocean University of China, China
Junyu Dong
Junyu Dong
Ocean University of China
T
Tao Gong
School of Computer Science and Technology, East China Normal University, China
Man Lan
Man Lan
East China Normal University,School of Computer Science and Technology
NLP