🤖 AI Summary
To address the severe downstream performance degradation caused by low-bit (2–3 bit) post-training quantization (PTQ), this paper proposes Task-Circuit Quantization (TaCQ). TaCQ models model weights as interpretable, task-oriented “circuits,” dynamically identifying and preserving a critical weight subset—the “task circuit”—in 16-bit precision, while quantizing remaining weights to low bit-widths. This identification is guided by quantization error propagation modeling, task-performance gradient attribution, and comparative weight importance analysis. Unlike conventional mixed-precision PTQ methods relying on hand-crafted heuristics or global statistics, TaCQ enables the first task-driven, localized, and adaptive identification of critical weights. Evaluated on Llama-3-8B-Instruct and Qwen2.5, TaCQ recovers 96% of original MMLU accuracy at 3.1 bits—outperforming SPQR by 5.25%; at 2 bits, it surpasses SliM-LLM by 14.74% on average, and even without task-specific calibration, achieves a 7.20% gain.
📝 Abstract
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.