Structure-Aware Fill-in-the-Middle Pretraining for Code

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing code large language models employ character-level random masking in Fill-in-the-Middle (FIM) pretraining, ignoring syntactic structure and thus causing semantic fragmentation and misalignment with realistic code editing patterns. This work proposes an Abstract Syntax Tree (AST)-driven, structure-aware FIM pretraining paradigm: leveraging ASTs to identify complete syntactic units (e.g., functions, expressions) and enforce syntax-consistent context completion. Key contributions include: (1) the first AST-guided FIM pretraining method; (2) Real-FIM-Eval—the first multilingual benchmark derived from real GitHub commits for evaluating practical code editing; and (3) substantial improvements in real-world code editing capability on 1B- and 8B-parameter models, with up to +5.0 points on standard FIM benchmarks. The implementation is open-sourced and supports 12 programming languages.

Technology Category

Application Category

📝 Abstract

Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at https://github.com/gonglinyuan/ast_fim.

Problem

Research questions and friction points this paper is trying to address.

Improving code LLMs by using ASTs for structured FIM pretraining

Addressing incoherent training in random-character FIM methods

Enhancing real-world code editing performance via AST-FIM

Innovation

Methods, ideas, or system contributions that make the work stand out.

AST-FIM uses Abstract Syntax Trees for masking

Ensures coherent training with code structures

Outperforms standard FIM by 5 points

🔎 Similar Papers

Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning

2024-10-04arXiv.orgCitations: 5

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow