Context-Free Grammar Inference for Complex Programming Languages in Black Box Settings

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to infer expressive context-free grammars for complex programming languages such as C, C++, and Java within a reasonable timeframe. This work proposes Crucio, the first method to efficiently infer grammars for large-scale, complex languages in a black-box setting. Crucio constructs a decomposition forest and a distribution matrix to extract short samples from runtime execution traces and decompose them into structured data, enabling joint inference of lexical and syntactic components. Experimental results demonstrate that Crucio successfully infers grammars for complex languages with up to 23 times more nonterminals than previous benchmarks, while achieving a 1.37× improvement in recall and a 1.21× gain in F1 score on standard, simpler benchmarks. These advances significantly overcome the limitations of current tools in both efficiency and expressive power.

Technology Category

Application Category

📝 Abstract
Grammar inference for complex programming languages remains a significant challenge, as existing approaches fail to scale to real world datasets within practical time constraints. In our experiments, none of the state-of-the-art tools, including Arvada, Treevada and Kedavra were able to infer grammars for complex languages such as C, C++, and Java within 48 hours. Arvada and Treevada perform grammar inference directly on full-length input examples, which proves inefficient for large files commonly found in such languages. While Kedavra introduces data decomposition to create shorter examples for grammar inference, its lexical analysis still relies on the original inputs. Additionally, its strict no-overgeneralization constraint limits the construction of complex grammars. To overcome these limitations, we propose Crucio, which builds a decomposition forest to extract short examples for lexical and grammar inference via a distributional matrix. Experimental results show that Crucio is the only method capable of successfully inferring grammars for complex programming languages (where the number of nonterminals is up to 23x greater than in prior benchmarks) within reasonable time limits. On the prior simple benchmark, Crucio achieves an average recall improvement of 1.37x and 1.19x over Treevada and Kedavra, respectively, and improves F1 scores by 1.21x and 1.13x.
Problem

Research questions and friction points this paper is trying to address.

grammar inference
context-free grammar
complex programming languages
black-box setting
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

grammar inference
decomposition forest
distributional matrix
context-free grammar
black-box learning
🔎 Similar Papers
2024-08-22arXiv.orgCitations: 2
2024-02-08International Conference on Machine LearningCitations: 6
F
Feifei Li
Tsinghua Shenzhen International Graduate School, China; Peng Cheng Laboratory, China
Xiao Chen
Xiao Chen
The University of Newcastle
Software SecurityMobile SecurityAI for Security
Xiaoyu Sun
Xiaoyu Sun
Australian National University
Software EngineeringProgram AnalysisIntelligent Software EngineeringSoftware Security
Xi Xiao
Xi Xiao
Oak Ridge National Laboratory | University of Alabama at Birmingham
LLM / MLLM EfficiencyImage / Video GenerationImage / Video Understanding
S
Shaohua Wang
Central University of Finance and Economics, China
Y
Yong Ding
School of Computer Science, Guangdong University of Science and Technology, Dongguan 523668, China
S
Sheng Wen
School of Computer Science, Guangdong University of Science and Technology, Dongguan 523668, China
Qing Li
Qing Li
Pengcheng Laboratory
Indoor localizationdepth predictioncamera relocalizationneural reonstruction