D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of data-free quantization for Contrastive Language–Image Pretraining (CLIP) models. We propose D4C, the first data-free quantization framework specifically designed for CLIP. D4C jointly optimizes pseudo-data generation and quantization via three synergistic components: prompt-guided semantic injection, structured contrastive generation, and perturbation-aware enhancement—enabling synthesis of semantically faithful and structurally diverse pseudo-images. Unlike existing data-free quantization (DFQ) methods, which suffer severe performance degradation on CLIP, D4C substantially improves zero-shot transfer accuracy: under W4A8 quantization, it boosts ImageNet-1K top-1 accuracy by 1.4% (ResNet-50) and 5.7% (ViT-B/32), and achieves up to 18.9% gains on CIFAR benchmarks. To our knowledge, this is the first demonstration of feasibility and effectiveness of data-free quantization for large-scale multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
Problem

Research questions and friction points this paper is trying to address.

Extending data-free quantization to vision-language CLIP models
Addressing semantic insufficiency in synthesized CLIP training samples
Improving intra-image diversity for contrastive language-image models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates semantically rich pseudo images using text prompts
Reproduces natural image structures with contrastive foreground-background synthesis
Enhances diversity through controlled perturbation techniques
🔎 Similar Papers
No similar papers found.
Wenlun Zhang
Wenlun Zhang
Keio University
Deep LearningIntegrated CircuitsElectronics
Yunshan Zhong
Yunshan Zhong
Hainan university
Z
Zihao Ding
Department of Electronics and Electrical Engineering, Keio University
X
Xinyu Li
Department of Electronics and Electrical Engineering, Keio University
Kentaro Yoshioka
Kentaro Yoshioka
Keio University
Efficient hardwareIntelligent sensing systemsSensor