Quantizing Large Language Models for Code Generation: A Differentiated Replication

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address excessive memory consumption and carbon footprint in deploying large language models (LLMs) for code generation, this work systematically investigates the feasibility of ultra-low-bit quantization for code-specialized LLMs up to 34B parameters. We advance the state of the art by achieving the first successful 4-bit weight quantization for code-generation LLMs, leveraging advanced quantization methods (AWQ, GPTQ) and code-domain calibration datasets. Our empirically grounded, differential quantization framework enables reproducible, task-aware compression. Results show that 4-bit quantization achieves ~70% memory reduction with no statistically significant degradation on HumanEval and MBPP benchmarks—establishing it as the current Pareto-optimal trade-off. Moreover, under extreme 2-/3-bit quantization, code-specific calibration recovers over 40% of the performance loss. This work provides a reproducible, green deployment pathway for efficient code-generation models, along with critical empirical benchmarks for low-bit quantization of code LLMs.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. The LLM effectiveness generally increases with its size: The higher the number of LLM's trainable parameters the better its ability to implement code. However, when it comes to deploying LLM-based code generators, larger LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. A previous work by Wei et al. proposed to leverage quantization techniques to reduce the memory footprint of LLM-based code generators without substantially degrading their effectiveness. In short, they studied LLMs featuring up to 16B parameters, quantizing their precision from floating point 32 bits down to int 8 bits and showing their limited impact on code generation performance. Given the fast pace at which LLM capabilities and quantization techniques are evolving, in this work we present a differentiated replication of the work by Wei et al. in which we consider (i) on the one side, more recent and larger code-related LLMs, of up to 34B parameters; (ii) the latest advancements in model quantization techniques, which allow pushing the compression to the extreme quantization level of 2 bits per model parameter and; (iii) different types of calibration datasets to guide the quantization process, including code-specific ones. Our empirical evaluation reveals that the new frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70% compared to the original model without observing any significant decrease in performance. Additionally, when the quantization becomes even more extreme (3 and 2 bits), a code-specific calibration dataset helps to limit the loss of performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory footprint of large language models for code generation.

Exploring extreme quantization techniques up to 2 bits per parameter.

Evaluating performance impact of quantization on code generation tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantizes LLMs to 4-bit precision

Uses code-specific calibration datasets

Reduces memory footprint by 70%

🔎 Similar Papers

Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models