CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing CLIP knowledge distillation methods, which fail to explicitly model the multidirectional relational dependencies between teacher and student embeddings, thereby struggling to preserve structured semantic information. To overcome this, the authors propose a relational knowledge distillation framework that jointly captures cross-modal and cross-layer relational structures through Vertical Relation Distillation (VRD) and Cross Relation Distillation (XRD). VRD enforces consistency in distillation strength at the distributional level, while XRD introduces bidirectional symmetric constraints to align cross-modal similarity distributions. This end-to-end framework significantly enhances the fidelity of lightweight student models to the geometric structure of teacher embeddings, achieving a 0.8 percentage point improvement over current state-of-the-art methods on zero-shot transfer tasks.

Technology Category

Application Category

📝 Abstract
CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student similarity distributions. By jointly modeling multi-directional relational structures, CLIP-RD promotes faithful alignment of the student embedding geometry with that of the teacher, outperforming existing methods by 0.8%p.
Problem

Research questions and friction points this paper is trying to address.

CLIP
knowledge distillation
relational dependencies
embedding geometry
zero-shot generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relational Knowledge Distillation
Vertical Relational Distillation
Cross Relational Distillation
CLIP
Zero-shot Generalization
🔎 Similar Papers
No similar papers found.
J
Jeannie Chung
Ewha Womans University, Seoul 03760, South Korea
H
Hanna Jang
Ewha Womans University, Seoul 03760, South Korea
I
Ingyeong Yang
Ewha Womans University, Seoul 03760, South Korea
Uiwon Hwang
Uiwon Hwang
Assistant Professor, Computer Science and Engineering, Ewha Womans University
Generative AIData-Centric AIArtificial General IntelligenceMachine Learning
J
Jaehyung Sim
Ewha Womans University, Seoul 03760, South Korea