Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high deployment cost of vision-language models (VLMs) and the significant performance degradation commonly caused by existing post-training quantization methods, as quantization-aware training (QAT) remains underexplored in this domain. The authors propose GRACE, a unified framework that integrates QAT with knowledge distillation through the information bottleneck principle. GRACE employs a confidence-gated mechanism to filter unreliable supervision signals, preserves visual token structure via relation-centered kernel alignment, and dynamically balances fidelity and information capacity constraints using a Lagrangian adaptive controller. Experiments on LLaVA and Qwen series models demonstrate that INT4-quantized variants surpass FP16 baselines—e.g., LLaVA-1.5-7B achieves 70.1 vs. 66.8 on SQA—nearly matching teacher model performance while delivering a 3× increase in inference throughput and a 54% reduction in memory usage.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
quantization
accuracy loss
quantization-aware training
efficient deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Training
Confidence-Gated Distillation
Relational Alignment
Information Bottleneck
Vision-Language Models
🔎 Similar Papers
No similar papers found.
Y
Yanlong Chen
Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
A
A. Habibian
Qualcomm AI Research, Amsterdam, the Netherlands
Luca Benini
Luca Benini
ETH Zürich, Università di Bologna
Integrated CircuitsComputer ArchitectureEmbedded SystemsVLSIMachine Learning
Yawei Li
Yawei Li
ETH Zurich
Computer VisionModel AccelerationTinyMLBiosignal Processing