π€ AI Summary
This work addresses the challenge of automatically and efficiently allocating mixed precision under arbitrary bit budgets to optimize the trade-off between accuracy and resource consumption in large language models. The authors propose GAMMA, a training-free framework that leverages teacher-forced hidden state reconstruction to learn each moduleβs precision preference and employs integer programming to map these preferences into discrete bit allocations satisfying strict budget constraints. Its key innovation lies in enabling βlearn once, reuse across budgets,β thereby eliminating the need for repeated computation per target budget and significantly reducing deployment overhead while accounting for inter-module interactions and budget limitations. Experiments on Llama and Qwen models (8Bβ32B) demonstrate that GAMMA substantially outperforms both fixed-precision and search-based mixed-precision methods, achieving up to a 12.99-point average performance gain and matching the accuracy of 3-bit fixed precision at only 2.5 bits on average.
π Abstract
Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.