🤖 AI Summary
This work addresses the problem of part-of-speech (POS) tagging for source code identifiers—a task where general-purpose NLP tools suffer from poor generalization in code contexts. To overcome this, we propose the first syntax-pattern mapping method customized to developers’ naming conventions. Our approach constructs a manually validated ground-truth repository linking identifiers to syntactic patterns and designs a feature engineering pipeline integrating character-level features with domain-specific knowledge. We employ a GradientBoostingClassifier for end-to-end mapping. The key contribution lies in formulating identifier naming as a natural-language syntax-pattern sequence generation task—enabled by domain-specific ground truth and tailored features—which substantially improves generalization. Experimental results demonstrate statistically significant gains in accuracy over conventional POS taggers and state-of-the-art code analysis tools. The implementation is publicly available.
📝 Abstract
The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github