🤖 AI Summary
To address the high computational cost and poor deployability of large language models (LLMs) for programming tasks, this paper proposes a lightweight, domain-aware pruning method tailored for code generation. Leveraging a domain-calibrated dataset spanning Python, Java, C++, and JavaScript, it introduces the first programming-language-specific unstructured pruning adaptation of Wanda. The core contributions are: (1) empirical identification of task-specific neural subregions activated during diverse coding tasks—enabling fine-grained, specialization-aware pruning; and (2) the first application of domain-calibrated data-driven submodel extraction to multilingual code generation. Experiments demonstrate that, with up to 60% reduction in parameter count and GPU memory footprint, the pruned models retain near-full-model accuracy on standard code generation benchmarks. This enables real-time inference on consumer-grade GPUs and supports end-to-end local development workflows.
📝 Abstract
Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks. However, their broader adoption is limited by significant computational demands and high resource requirements, particularly memory and processing power. To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters. However, current approaches do not focus on the efficient extraction of programming-language-specific sub-models. In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning (i.e., Wanda). We investigate the impact of different domain-specific calibration datasets on pruning outcomes across three distinct domains and extend our analysis to extracting four language-specific sub-models: Python, Java, C++, and JavaScript. We are the first to efficiently extract programming-language-specific sub-models using appropriate calibration datasets while maintaining acceptable accuracy w.r.t. full models. We are also the first to provide analytical evidence that domain-specific tasks activate distinct regions within LLMs, supporting the creation of specialized sub-models through unstructured pruning. We believe that this work has significant potential to enhance LLM accessibility for coding by reducing computational requirements to enable local execution on consumer-grade hardware, and supporting faster inference times critical for real-time development feedback.