The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

This work addresses quantization-induced representation collapse (QIRC) in joint-embedding models such as CLIP under INT8 quantization, which degrades cosine alignment of multimodal embeddings and impairs zero-shot retrieval performance. The study is the first to identify and quantify QIRC, proposing LRA-EE—a method that replaces [CLS] representations with spatial-semantic aggregation and introduces a multi-feature gating mechanism incorporating confidence, class margin, and spatial activation variance. Additionally, an adaptive early-exit strategy based on layer-wise information-to-noise ratio enables shallow-layer exits to prevent propagation of deep-layer quantization noise. Evaluated on ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% compared to the INT8 baseline while improving Top-1 accuracy by 2.44 percentage points (from 58.72% to 61.16%), demonstrating the efficacy of the proposed “rescue effect.”

📝 Abstract

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

Problem

Research questions and friction points this paper is trying to address.

Quantization-Induced Representation Collapse

Vision-Language Models

INT8 quantization

multimodal embedding

zero-shot retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Induced Representation Collapse

Early Exit

Spatio-Semantic Aggregation