InfiJanice: Joint Analysis and In-situ Correction Engine for Quantization-Induced Math Degradation in Large Language Models

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Quantization of large language models (LLMs) severely degrades mathematical reasoning performance—up to 69.81% accuracy drop—hindering practical deployment. Method: We propose the first automated, four-category error attribution framework specifically designed for quantization-induced mathematical degradation. It leverages a compact, “silver-bullet” few-shot dataset (332 examples) enabling lightweight fine-tuning on a single GPU in 3–5 minutes. Our approach integrates quantitative analysis from AWQ, GPTQ, and SmoothQuant; an automated error classification pipeline; capability-aware data distillation; and efficient fine-tuning. Contribution/Results: On GSM8K, MATH, and AIME benchmarks, quantized models fully recover full-precision reasoning accuracy. Inference latency decreases by 42%, and GPU memory consumption drops by 73%, achieving both performance restoration and substantial efficiency gains.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance on complex reasoning benchmarks such as GSM8K, MATH, and AIME. However, the substantial computational demands of these tasks pose significant challenges for real-world deployment. Model quantization has emerged as a promising approach to reduce memory footprint and inference latency by representing weights and activations with lower bit-widths. In this work, we conduct a comprehensive study of mainstream quantization methods(e.g., AWQ, GPTQ, SmoothQuant) on the most popular open-sourced models (e.g., Qwen2.5, LLaMA3 series), and reveal that quantization can degrade mathematical reasoning accuracy by up to 69.81%. To better understand this degradation, we develop an automated assignment and judgment pipeline that qualitatively categorizes failures into four error types and quantitatively identifies the most impacted reasoning capabilities. Building on these findings, we employ an automated data-curation pipeline to construct a compact"Silver Bullet"datasets. Training a quantized model on as few as 332 carefully selected examples for just 3-5 minutes on a single GPU is enough to restore its reasoning accuracy to match that of the full-precision baseline.
Problem

Research questions and friction points this paper is trying to address.

Quantization degrades math reasoning in LLMs by up to 69.81%
Identifies four error types impacting reasoning capabilities
Restores accuracy with minimal training on curated data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline categorizes quantization errors
Constructs compact Silver Bullet dataset
Quick training restores full-precision accuracy
🔎 Similar Papers
No similar papers found.