π€ AI Summary
This work addresses quantization-induced representation collapse (QIRC) in joint-embedding models such as CLIP under INT8 quantization, which degrades cosine alignment of multimodal embeddings and impairs zero-shot retrieval performance. The study is the first to identify and quantify QIRC, proposing LRA-EEβa method that replaces [CLS] representations with spatial-semantic aggregation and introduces a multi-feature gating mechanism incorporating confidence, class margin, and spatial activation variance. Additionally, an adaptive early-exit strategy based on layer-wise information-to-noise ratio enables shallow-layer exits to prevent propagation of deep-layer quantization noise. Evaluated on ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% compared to the INT8 baseline while improving Top-1 accuracy by 2.44 percentage points (from 58.72% to 61.16%), demonstrating the efficacy of the proposed βrescue effect.β
π Abstract
Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.