Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Uniform quantization strategies (e.g., 4-bit) fail to optimize memory usage in large language model inference, as KV cache—not model weights—often dominates memory consumption. Method: Through systematic evaluation of 1,700 inference configurations on AIME25 and GPQA-Diamond, we identify an 8-bit/4B parameter threshold: smaller models benefit from higher-weight precision for accuracy, whereas larger models require expanded KV cache capacity to support longer generations. Based on this insight, we propose the first model-scale–aware memory allocation principle that jointly optimizes weight quantization precision, KV cache compression, and generation length. Contribution/Results: Our adaptive strategy significantly outperforms universal 4-bit quantization—achieving higher inference accuracy and deployment efficiency while maintaining low memory footprint. Experimental results demonstrate consistent improvements across diverse model scales and benchmark tasks.

Technology Category

Application Category

📝 Abstract

While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where the KV cache rather than model size can dominate memory. Through systematic experiments across 1,700 inference scenarios on AIME25 and GPQA-Diamond, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to more weights rather than longer generation, while larger models achieve better accuracy by allocating memory to longer generations. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for larger ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies from those established for non-reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Memory optimization strategies differ for reasoning versus non-reasoning models

KV cache dominates memory usage in reasoning models across scales

Optimal memory allocation shifts from weights to generation length with scale

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-dependent memory optimization for reasoning models

KV cache management overcomes model size dominance

Different strategies for small versus large reasoning models

🔎 Similar Papers

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models