aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

📅 2024-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between accuracy and inference efficiency in large language model (LLM)-based code completion, this paper introduces Qwen2.5-Coder—a lightweight 7B-parameter code-specialized LLM. We propose a novel Structured Fill-in-the-Middle (SFIM) training objective, integrated with cross-file context sampling, structured syntactic-aware modeling, and multi-objective supervised training on 1.2 trillion high-quality code tokens. A robust data cleaning and augmentation pipeline further enhances training data quality. Evaluated on five established and one proprietary code completion benchmarks, Qwen2.5-Coder consistently outperforms six same-sized baselines and surpasses larger models—including StarCoder2-15B and CodeLlama-34B—demonstrating a “small model, large performance” breakthrough. As of January 2025, it has garnered 2,226 GitHub stars.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs have lower inference efficiency, affecting developers' experience and productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. Until January 2025, aiXcoder-7B has received 2,226 GitHub Stars.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Code Completion
Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Fill in the Middle (SFIM)
Diverse Data Selection Strategy
High-Quality Large-Scale Code Corpus
🔎 Similar Papers
No similar papers found.
Siyuan Jiang
Siyuan Jiang
Eastern Michigan University
software engineeringprogram comprehensionautomatic program generationdeep learningnatural language processing
J
Jia Li
Peking University, Beijing, China
H
He Zong
aiXcoder, Beijing, China
H
Huanyu Liu
Peking University, Beijing, China
H
Hao Zhu
Peking University, Beijing, China
S
Shukai Hu
aiXcoder, Beijing, China
E
Erlu Li
aiXcoder, Beijing, China
J
Jiazheng Ding
aiXcoder, Beijing, China
Y
Yu Han
aiXcoder, Beijing, China
W
Wei Ning
aiXcoder, Beijing, China
G
Gen Wang
aiXcoder, Beijing, China
Yihong Dong
Yihong Dong
Peking University
Code GenerationLarge Language Models
Kechi Zhang
Kechi Zhang
Peking University
AI4SE
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning