SCALAR: A Part-of-speech Tagger for Identifiers

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

134K/year

🤖 AI Summary

This work addresses the problem of part-of-speech (POS) tagging for source code identifiers—a task where general-purpose NLP tools suffer from poor generalization in code contexts. To overcome this, we propose the first syntax-pattern mapping method customized to developers’ naming conventions. Our approach constructs a manually validated ground-truth repository linking identifiers to syntactic patterns and designs a feature engineering pipeline integrating character-level features with domain-specific knowledge. We employ a GradientBoostingClassifier for end-to-end mapping. The key contribution lies in formulating identifier naming as a natural-language syntax-pattern sequence generation task—enabled by domain-specific ground truth and tailored features—which substantially improves generalization. Experimental results demonstrate statistically significant gains in accuracy over conventional POS taggers and state-of-the-art code analysis tools. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github

Problem

Research questions and friction points this paper is trying to address.

Develops a part-of-speech tagger for source code identifiers

Trains model to recognize natural language in developer-created identifiers

Compares performance with existing taggers for identifier annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GradientBoostingClassifier for training

Specializes in annotating source code identifiers

Improves accuracy over existing part-of-speech taggers

🔎 Similar Papers

No similar papers found.