TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the deployment challenges of vision-language models (e.g., CLIP) on resource-constrained devices, this work proposes a ternary quantization framework tailored for extreme model compression. The method jointly applies ternary weight representation (99% ternarized weights, averaging 1.58 bits/parameter), quantization-aware training, and multi-stage knowledge distillation to co-compress both visual and textual encoders. Evaluated across 41 zero-shot image classification and vision–language retrieval benchmarks, it maintains state-of-the-art performance while achieving 16.98× model compression, 2.3× inference speedup, 16× storage reduction, 10× memory footprint decrease, and 60% structural sparsity. Its core contribution lies in the first empirical validation of high-fidelity deployment of CLIP-like multimodal foundation models under sub-2-bit extreme quantization—establishing a novel paradigm for efficient multimodal understanding at the edge.

Technology Category

Application Category

📝 Abstract

Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99% ternarized weights with 1.58-bit representation, 16.98 $ imes$ compression ratio, 2.3 $ imes$ inference acceleration, 16 $ imes$ storage reduction, 10 $ imes$ memory optimization, and 60% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.

Problem

Research questions and friction points this paper is trying to address.

Compressing CLIP model weights to ternary format

Maintaining performance while reducing computational resource requirements

Enabling efficient deployment on resource-constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ternary weights compress vision and text encoders

Quantization-aware training prevents precision degradation

Distillation enables efficient computation and storage

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment