Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited effectiveness of general-purpose large language models in code completion tasks within private enterprise codebases, where domain-specific structures and coding styles hinder performance. To overcome this challenge, the authors propose a semantic-scope-based approach for automatically constructing training data, which integrates retrieval-augmented generation (RAG) with supervised fine-tuning to efficiently customize medium-scale language models. Experimental results on two real-world enterprise codebases demonstrate that the resulting customized models significantly outperform larger, unadapted general-purpose models in code completion accuracy, while maintaining strong generalization capabilities on public benchmarks. These findings validate both the efficacy and practicality of the proposed method for adapting language models to proprietary software environments.

Technology Category

Application Category

📝 Abstract
Code completion (CC) is a task frequently used by developers when working in collaboration with LLM-based programming assistants. Despite the increased performance of LLMs on public benchmarks, out of the box LLMs still have a hard time generating code that aligns with a private code repository not previously seen by the model's training data. Customizing code LLMs to a private repository provides a way to improve the model performance. In this paper we present our approach for automated LLM customization based on semantic scopes in the code. We evaluate LLMs on real industry cases with two private enterprise code repositories with two customization strategies: Retrieval-Augmented Generation (RAG) and supervised Fine-Tuning (FT). Our mechanism for ingesting the repository's data and formulating the training data pairs with semantic scopes helps models to learn the underlying patterns specific to the repository, providing more precise code to developers and helping to boost their productivity. The code completions of moderately sized customized models can be significantly better than those of uncustomized models of much larger capacity. We also include an analysis of customization on two public benchmarks and present opportunities for future work.
Problem

Research questions and friction points this paper is trying to address.

code completion
large language models
private code repositories
model customization
semantic scopes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Scopes
LLM Customization
Code Completion
Enterprise Code Repositories
Fine-Tuning
🔎 Similar Papers
No similar papers found.
U
Ulrich Finkler
IBM Research, Yorktown Heights, New York, USA
I
Irene Manotas
IBM Research, Yorktown Heights, New York, USA
Wei Zhang
Wei Zhang
IBM T.J.Watson Research Center
Computer SystemsSoftware EngineeringConcurrent Programming
G
Geert Janssen
IBM Research, Yorktown Heights, New York, USA
O
Octavian Popescu
IBM Research, Yorktown Heights, New York, USA
S
Shyam Ramji
IBM Research, Yorktown Heights, New York, USA