Inferring Input Grammars from Code with Symbolic Parsing

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing grammar inference techniques fail to accurately reverse-engineer context-free grammars from real-world recursive-descent parsers lacking formal syntactic specifications. Method: We propose the first symbolic grammar mining approach tailored to industrial-grade recursive-descent parsers. Our method statically models parser semantics via program analysis and integrates nonterminal mapping, path truncation, and bounded recursion unfolding to mitigate path explosion and infinite recursion—enabling fully automated, seedless inference of precise context-free grammars. Contribution/Results: Evaluated on complex parsers including TINY-C and JSON, our technique achieves 99–100% grammar extraction accuracy, substantially outperforming prior work. This is the first end-to-end grammar reverse-engineering solution for production recursive-descent parsers, enabling grammar-based full-coverage test generation, protocol reverse engineering, and automatic documentation.

Technology Category

Application Category

📝 Abstract
Generating effective test inputs for a software system requires that these inputs be valid, as they will otherwise be rejected without reaching actual functionality. In the absence of a specification for the input language, common test generation techniques rely on sample inputs, which are abstracted into matching grammars and/or evolved guided by test coverage. However, if sample inputs miss features of the input language, the chances of generating these features randomly are slim. In this work, we present the first technique for symbolically and automatically mining input grammars from the code of recursive descent parsers. So far, the complexity of parsers has made such a symbolic analysis challenging to impossible. Our realization of the symbolic parsing technique overcomes these challenges by (1) associating each parser function parse_ELEM() with a nonterminal; (2) limiting recursive calls and loop iterations, such that a symbolic analysis of parse_ELEM() needs to consider only a finite number of paths; and (3) for each path, create an expansion alternative for. Being purely static, symbolic parsing does not require seed inputs; as it mitigates path explosion, it scales to complex parsers. Our evaluation promises symbolic parsing to be highly accurate. Applied on parsers for complex languages such as TINY-C or JSON, our STALAGMITE implementation extracts grammars with an accuracy of 99--100%, widely improving over the state of the art despite requiring only the program code and no input samples. The resulting grammars cover the entire input space, allowing for comprehensive and effective test generation, reverse engineering, and documentation.
Problem

Research questions and friction points this paper is trying to address.

Automatically infer input grammars from recursive descent parsers.
Overcome challenges in symbolic analysis of complex parsers.
Enable comprehensive test generation without requiring input samples.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic parsing mines grammars from parser code.
Limits recursion and loops for finite path analysis.
Generates accurate grammars without input samples.
🔎 Similar Papers
No similar papers found.