🤖 AI Summary
Quantization of large language models (LLMs) severely degrades mathematical reasoning performance—up to 69.81% accuracy drop—hindering practical deployment.
Method: We propose the first automated, four-category error attribution framework specifically designed for quantization-induced mathematical degradation. It leverages a compact, “silver-bullet” few-shot dataset (332 examples) enabling lightweight fine-tuning on a single GPU in 3–5 minutes. Our approach integrates quantitative analysis from AWQ, GPTQ, and SmoothQuant; an automated error classification pipeline; capability-aware data distillation; and efficient fine-tuning.
Contribution/Results: On GSM8K, MATH, and AIME benchmarks, quantized models fully recover full-precision reasoning accuracy. Inference latency decreases by 42%, and GPU memory consumption drops by 73%, achieving both performance restoration and substantial efficiency gains.
📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact"Silver Bullet"datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.