TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the deployment challenges of vision-language models (e.g., CLIP) on resource-constrained devices, this work proposes a ternary quantization framework tailored for extreme model compression. The method jointly applies ternary weight representation (99% ternarized weights, averaging 1.58 bits/parameter), quantization-aware training, and multi-stage knowledge distillation to co-compress both visual and textual encoders. Evaluated across 41 zero-shot image classification and vision–language retrieval benchmarks, it maintains state-of-the-art performance while achieving 16.98× model compression, 2.3× inference speedup, 16× storage reduction, 10× memory footprint decrease, and 60% structural sparsity. Its core contribution lies in the first empirical validation of high-fidelity deployment of CLIP-like multimodal foundation models under sub-2-bit extreme quantization—establishing a novel paradigm for efficient multimodal understanding at the edge.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99% ternarized weights with 1.58-bit representation, 16.98 $ imes$ compression ratio, 2.3 $ imes$ inference acceleration, 16 $ imes$ storage reduction, 10 $ imes$ memory optimization, and 60% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.
Problem

Research questions and friction points this paper is trying to address.

Compressing CLIP model weights to ternary format
Maintaining performance while reducing computational resource requirements
Enabling efficient deployment on resource-constrained devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary weights compress vision and text encoders
Quantization-aware training prevents precision degradation
Distillation enables efficient computation and storage
🔎 Similar Papers
No similar papers found.
S
Shu-Hao Zhang
State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210063, China
W
Wei-Cheng Tang
State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210063, China
C
Chen Wu
Microsoft AI, Beijing 100080, China
P
Peng Hu
Microsoft AI, Beijing 100080, China
N
Nan Li
Microsoft AI, Beijing 100080, China
Liang-Jie Zhang
Liang-Jie Zhang
Distinguished Professor@Shenzhen University (SZU), ACM DS & IEEE Fellow, ex-RSM@IBM & ex-CTO@Kingdee
Services ComputingAIBlockchain & IOTSOA & Cloud ComputingDigital Transformation
Q
Qi Zhang
Microsoft AI, Beijing 100080, China
S
Shao-Qun Zhang
State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210063, China