From Token to Line: Enhancing Code Generation with a Long-Term Perspective

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) for code generation often suffer from local overfitting and redundant outputs due to reliance on token-level autoregressive modeling. Method: This paper proposes a long-range, line-based generative paradigm—treating entire lines of code as fundamental modeling units—to overcome token-level limitations. We first observe that LLM attention mechanisms exhibit pronounced concentration at line endings; leveraging this insight, we design a line-level autoregressive framework. Furthermore, we introduce LSR-MCTS, a novel algorithm that jointly optimizes line-order decisions and generation trajectories, augmented with a node-level self-correction mechanism to enhance both output diversity and functional correctness. Contribution/Results: Our approach achieves state-of-the-art performance across three major benchmarks—HumanEval, MBPP, and CodeContests—improving average functional correctness by +8.2% while significantly enhancing structural coherence and syntactic soundness.

Technology Category

Application Category

📝 Abstract
The emergence of large language models (LLMs) has significantly promoted the development of code generation task, sparking a surge in pertinent literature. Current research is hindered by redundant generation results and a tendency to overfit local patterns in the short term. Although existing studies attempt to alleviate the issue by adopting a multi-token prediction strategy, there remains limited focus on choosing the appropriate processing length for generations. By analyzing the attention between tokens during the generation process of LLMs, it can be observed that the high spikes of the attention scores typically appear at the end of lines. This insight suggests that it is reasonable to treat each line of code as a fundamental processing unit and generate them sequentially. Inspired by this, we propose the extbf{LSR-MCTS} algorithm, which leverages MCTS to determine the code line-by-line and select the optimal path. Further, we integrate a self-refine mechanism at each node to enhance diversity and generate higher-quality programs through error correction. Extensive experiments and comprehensive analyses on three public coding benchmarks demonstrate that our method outperforms the state-of-the-art performance approaches.
Problem

Research questions and friction points this paper is trying to address.

Addresses redundant code generation in LLMs
Optimizes processing length for code lines
Enhances diversity and quality via self-refine
Innovation

Methods, ideas, or system contributions that make the work stand out.

Line-level code generation using LSR-MCTS
Self-refine mechanism for error correction
MCTS-based optimal path selection
🔎 Similar Papers
No similar papers found.
T
Tingwei Lu
Tsinghua University
Y
Yangning Li
Tsinghua University, Peng Cheng Laboratory
Liyuan Wang
Liyuan Wang
Tsinghua University
bio-inspired learningcontinual learningneuroscience
Binghuai Lin
Binghuai Lin
Tencent
machine learningdeep learningLLM
Jiwei Tang
Jiwei Tang
Tsinghua University
Natural Language ProcessingLarge Language Model
W
Wanshi Xu
Peking University
H
Hai-Tao Zheng
Tsinghua University, Peng Cheng Laboratory
Y
Yinghui Li
Tsinghua University
B
Bingxu An
Tencent Technology Co., Ltd
Z
Zhao Wei
Tencent Technology Co., Ltd
Y
Yong Xu
Tencent Technology Co., Ltd