Bringing Structure to Naturalness: On the Naturalness of ASTs

📅 2024-04-14

🏛️ 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This paper investigates whether source code structured representations—specifically abstract syntax trees (ASTs)—exhibit statistical predictability, formalizing this as the “structural naturalness hypothesis.” Method: The authors first explicitly formulate and empirically test this hypothesis using TreeLSTMs, n-gram language models, and AST serialization techniques across multiple programming languages (Ruby, Java, Python). Contribution/Results: They find that AST naturalness is strongly language-dependent: TreeLSTMs achieve n-gram–level performance in Ruby but degrade significantly in Java and Python. Crucially, naturalness signals extracted directly from ASTs enable just-in-time defect prediction without manual feature engineering, achieving near-state-of-the-art performance. This work provides novel empirical evidence for inherent statistical regularities in code structure, advancing both theoretical understanding of code naturalness and practical design principles for tree-based deep learning models in software analytics.

Technology Category

Application Category

📝 Abstract

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language at the token level: source code can be natural. More recently, the structure of code - either as graphs or trees - has been successfully used to improve the state-of-the-art on numerous tasks: code suggestion, code summarisation, method naming etc. This body of work implicitly assumes that structured representations of code are similarly statistically predictable, i.e. natural. We consider that this view should be made explicit and propose directly studying the Structured Naturalness Hypothesis. Beyond just naming existing research that assumes this hypothesis and formulating it, we also provide evidence for tree representations: TreeLSTM models over ASTs for some languages, such as Ruby, are competitive with n-gram models while handling the syntax token issue highlighted by previous research ‘for free’. For other languages, such as Java or Python, we find tree models to perform worse, suggesting that downstream task improvement is uncorrelated to the language modelling task. Further, we show how one may use naturalness signals for near state-of-the-art results on just-in-time defect prediction without manual feature engineering work.

Problem

Research questions and friction points this paper is trying to address.

Investigates statistical predictability of structured code representations

Tests TreeLSTM models on ASTs across programming languages

Applies naturalness signals for defect prediction without manual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

TreeLSTM models analyze ASTs for code predictability

Structured Naturalness Hypothesis validates code structure predictability

Naturalness signals enable defect prediction without feature engineering

🔎 Similar Papers

Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AI's Understanding of Algorithms