🤖 AI Summary
To address the deployment challenges of vision-language models (e.g., CLIP) on resource-constrained devices, this work proposes a ternary quantization framework tailored for extreme model compression. The method jointly applies ternary weight representation (99% ternarized weights, averaging 1.58 bits/parameter), quantization-aware training, and multi-stage knowledge distillation to co-compress both visual and textual encoders. Evaluated across 41 zero-shot image classification and vision–language retrieval benchmarks, it maintains state-of-the-art performance while achieving 16.98× model compression, 2.3× inference speedup, 16× storage reduction, 10× memory footprint decrease, and 60% structural sparsity. Its core contribution lies in the first empirical validation of high-fidelity deployment of CLIP-like multimodal foundation models under sub-2-bit extreme quantization—establishing a novel paradigm for efficient multimodal understanding at the edge.
📝 Abstract
Recent years have witnessed an increasing interest in image-text contrastive modeling, exemplified by models such as Contrastive Language-Image Pretraining (CLIP). In this paper, we propose the TernaryCLIP, a lightweight computational framework that converts connection weights of both vision and text encoders of CLIP into the ternary format, instead of full-precision or floating ones. TernaryCLIP incorporates quantization-aware training and distillation modules, preventing precision degradation and enabling low-cost and high-efficiency computations. Comprehensive experiments demonstrate that TernaryCLIP can achieve up to 99% ternarized weights with 1.58-bit representation, 16.98 $ imes$ compression ratio, 2.3 $ imes$ inference acceleration, 16 $ imes$ storage reduction, 10 $ imes$ memory optimization, and 60% sparsity while maintaining promising performance on zero-shot image classification and image-text retrieval tasks across 41 commonly used datasets. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices. The model and code can be accessed from Hugging Face and GitHub.