Inferring Pluggable Types with Machine Learning

πŸ“… 2024-06-21
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 2
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
To address the high manual annotation cost of pluggable type systems (e.g., NullAway) in legacy Java codebases, this paper proposes an automated type qualifier inference method. Our approach introduces NaP-ASTβ€”a lightweight program representation that explicitly encodes data-flow semantics as structural hints. We conduct the first systematic empirical comparison of graph transformation networks (GTNs), graph convolutional networks (GCNs), and large language models (LLMs) for this task, demonstrating that GTNs achieve superior performance. Evaluated on 12 open-source Java projects, our GTN-based method attains 0.89 recall and 0.60 precision, significantly reducing spurious type warnings. We further identify a performance inflection point at approximately 16K Java classes, beyond which model accuracy stabilizes. This work establishes a scalable, high-precision paradigm for static-analysis-driven type enhancement in industrial Java ecosystems.

Technology Category

Application Category

πŸ“ Abstract
Pluggable type systems allow programmers to extend the type system of a programming language to enforce semantic properties defined by the programmer. Pluggable type systems are difficult to deploy in legacy codebases because they require programmers to write type annotations manually. This paper investigates how to use machine learning to infer type qualifiers automatically. We propose a novel representation, NaP-AST, that encodes minimal dataflow hints for the effective inference of type qualifiers. We evaluate several model architectures for inferring type qualifiers, including Graph Transformer Network, Graph Convolutional Network and Large Language Model. We further validated these models by applying them to 12 open-source programs from a prior evaluation of the NullAway pluggable typechecker, lowering warnings in all but one unannotated project. We discovered that GTN shows the best performance, with a recall of .89 and precision of 0.6. Furthermore, we conduct a study to estimate the number of Java classes needed for good performance of the trained model. For our feasibility study, performance improved around 16k classes, and deteriorated due to overfitting around 22k classes.
Problem

Research questions and friction points this paper is trying to address.

Automatically inferring pluggable type qualifiers using machine learning
Proposing NaP-AST representation for effective type qualifier inference
Evaluating model performance on real-world Java codebases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses machine learning for type qualifier inference
Introduces NaP-AST representation with dataflow hints
Evaluates Graph Transformer Network as best performer
πŸ”Ž Similar Papers
No similar papers found.
K
Kazi Amanul Islam Siddiqui
Department of Computer Science, New Jersey Institute of Technology, Newark, USA
Martin Kellogg
Martin Kellogg
Assistant Professor, NJIT
Software EngineeringVerification