Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the persistent value of syntax-aware representations in billion-parameter large language models (LLMs) for code generation. While syntactic errors have markedly declined in ultra-large LLMs, the utility of explicit syntactic information remains questionable. To address this, we propose GrammarCoder—a family of models that explicitly integrate programming language grammar via context-free grammar (CFG)-guided decoding, syntax-constrained token prediction, and a Transformer-adapted syntax embedding enhancement module. Our study provides the first empirical evidence that syntactic information continues to significantly improve semantic discrimination—not merely syntactic correctness—in ultra-large LLMs, effectively mitigating semantic errors induced by minor code perturbations. On HumanEval+ and MBPP+, GrammarCoder achieves substantial accuracy gains; syntax error rates approach zero, and semantic error rates decrease by 12.7%.

Technology Category

Application Category

📝 Abstract
Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
Problem

Research questions and friction points this paper is trying to address.

Explores grammar-based code representation benefits for large language models.
Assesses if grammar rules improve code generation accuracy in billion-scale models.
Investigates grammar's role in reducing semantic errors in code generation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed billion-scale GrammarCoder models
Incorporated grammar rules in code generation
Improved semantic differentiation in LLMs
🔎 Similar Papers
No similar papers found.
Qingyuan Liang
Qingyuan Liang
Peking University
Software EngineeringCode Generation
Z
Zhao Zhang
School of Computer Science, Peking University; Kuaishou Technology
Z
Zeyu Sun
Institute of Software, Chinese Academy of Sciences
Z
Zheng Lin
Kuaishou Technology
Q
Qi Luo
Department of Computer Science and Engineering, Southern University of Science and Technology
Y
Yueyi Xiao
School of Computer Science, Peking University
Yizhou Chen
Yizhou Chen
Peking University
AI4SEVulnerability DetectionFormal Verification
Y
Yuqun Zhang
Department of Computer Science and Engineering, Southern University of Science and Technology
H
Haotian Zhang
Kuaishou Technology
L
Lu Zhang
School of Computer Science, Peking University
B
Bin Chen
Kuaishou Technology
Yingfei Xiong
Yingfei Xiong
Associate Professor, Peking University
Software EngineeringProgramming LanguagesProgram RepairProgram SynthesisProgram Analysis