Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP suffers from fixed image resolution and lack of fine-grained spatial grounding, limiting its capability for pixel-level cross-modal alignment in image–text retrieval. To address this, we propose a meta teacher–student distillation framework that integrates YOLO-based region detection and textual span modeling into a cross-modal Transformer, enabling joint semantic–spatial alignment via bidirectional cross-attention. We further introduce a hybrid loss combining contrastive learning and cosine similarity to distill robust global representations. Trained on only 67.5K samples, our method achieves significant improvements in Recall@K and mAP while retaining 94% of CLIP’s zero-shot classification accuracy. Crucially, it enhances fine-grained retrieval without compromising CLIP’s generalization ability—effectively balancing domain specialization and broad applicability.

Technology Category

Application Category

📝 Abstract
We present Distill CLIP (DCLIP), a fine-tuned variant of the CLIP model that enhances multimodal image-text retrieval while preserving the original model's strong zero-shot classification capabilities. CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. These semantically and spatially aligned global representations guide the training of a lightweight student model using a hybrid loss that combines contrastive learning and cosine similarity objectives. Despite being trained on only ~67,500 samples curated from MSCOCO, Flickr30k, and Conceptual Captions-just a fraction of CLIP's original dataset-DCLIP significantly improves image-text retrieval metrics (Recall@K, MAP), while retaining approximately 94% of CLIP's zero-shot classification performance. These results demonstrate that DCLIP effectively mitigates the trade-off between task specialization and generalization, offering a resource-efficient, domain-adaptive, and detail-sensitive solution for advanced vision-language tasks. Code available at https://anonymous.4open.science/r/DCLIP-B772/README.md.
Problem

Research questions and friction points this paper is trying to address.

Enhance image-text retrieval with fine-tuned CLIP model
Overcome fixed resolution and limited context in CLIP
Balance task specialization and zero-shot generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta teacher-student distillation framework
Bidirectional cross-attention for enriched embeddings
Hybrid loss combining contrastive and cosine objectives
🔎 Similar Papers
No similar papers found.