From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Instruction-tuned large language models for code often lack boundary awareness in Fill-in-the-Middle (FIM) tasks, necessitating post-hoc truncation to discard extraneous output—yet cross-lingual truncation strategies are inconsistent and suboptimal. Method: The authors conduct the first systematic investigation into whether supervised fine-tuning (SFT) can inherently improve contextual and boundary alignment, eliminating reliance on heuristic post-processing. Using the Qwen2.5-Coder family, they introduce a binary evaluation paradigm—“complete-line vs. random-fragment”—on HumanEval Infilling and SAFIM benchmarks. Contribution/Results: SFT significantly enhances boundary-aware generation: fine-tuned models achieve optimal performance without post-processing in complete-line scenarios, while truncation remains necessary only for random fragments. This reveals the conditional necessity of post-processing in FIM, challenging the assumption of universal truncation requirements and providing empirical grounding for boundary-aware code generation.

Technology Category

Application Category

📝 Abstract

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned exttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the emph{middle} is a random span of code.

Problem

Research questions and friction points this paper is trying to address.

Evaluating raw LLM outputs for fill-in-the-middle code generation accuracy

Determining optimal truncation strategies for multi-language code outputs

Assessing post-processing necessity in instruction-tuned LLM code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-processing for FIM code evaluation

Supervised fine-tuning improves integration

Optimal truncation strategy investigation

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation