Does quantization affect models' performance on long-context tasks?

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Quantization’s impact on large language models (LLMs) in ultra-long-context tasks (>64K tokens) remains poorly understood, particularly across multilingual settings and diverse model architectures. Method: We systematically evaluate mainstream quantization methods—including FP8, GPTQ-int8/int4, AWQ-int4, and BNB-nf4—on multiple Llama-3.1 (8B/70B) and Qwen-2.5 (7B/32B/72B) models, benchmarking long-text understanding and generation in Chinese, English, and other languages. Contribution/Results: We reveal for the first time that quantization performance critically depends on the quantization scheme, model architecture, and task type: 8-bit methods incur only ~0.8% average accuracy drop, whereas 4-bit methods degrade up to 59% on non-English long inputs—challenging assumptions of universal quantization deployability. Notably, Qwen-2.5-72B maintains robustness under BNB-nf4, while Llama-3.1-70B suffers a 32% decline under identical configuration. Our study establishes the first empirical benchmark and design guidelines for quantization in ultra-long-context scenarios.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long-inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and with languages other than English.

Problem

Research questions and friction points this paper is trying to address.

Evaluates impact of quantization on LLMs' long-context task performance

Compares 5 quantization methods across 5 models on 9.7K test examples

Identifies significant accuracy drops in 4-bit methods for non-English inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic evaluation of quantized LLMs

Tested 8-bit and 4-bit quantization methods

Task-specific performance varies by model

🔎 Similar Papers

No similar papers found.