Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high deployment cost of vision-language models (VLMs) and the significant performance degradation commonly caused by existing post-training quantization methods, as quantization-aware training (QAT) remains underexplored in this domain. The authors propose GRACE, a unified framework that integrates QAT with knowledge distillation through the information bottleneck principle. GRACE employs a confidence-gated mechanism to filter unreliable supervision signals, preserves visual token structure via relation-centered kernel alignment, and dynamically balances fidelity and information capacity constraints using a Lagrangian adaptive controller. Experiments on LLaVA and Qwen series models demonstrate that INT4-quantized variants surpass FP16 baselines—e.g., LLaVA-1.5-7B achieves 70.1 vs. 66.8 on SQA—nearly matching teacher model performance while delivering a 3× increase in inference throughput and a 54% reduction in memory usage.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

quantization

accuracy loss

quantization-aware training

efficient deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantization-Aware Training

Confidence-Gated Distillation

Relational Alignment