SCALAR: A Part-of-speech Tagger for Identifiers

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of part-of-speech (POS) tagging for source code identifiers—a task where general-purpose NLP tools suffer from poor generalization in code contexts. To overcome this, we propose the first syntax-pattern mapping method customized to developers’ naming conventions. Our approach constructs a manually validated ground-truth repository linking identifiers to syntactic patterns and designs a feature engineering pipeline integrating character-level features with domain-specific knowledge. We employ a GradientBoostingClassifier for end-to-end mapping. The key contribution lies in formulating identifier naming as a natural-language syntax-pattern sequence generation task—enabled by domain-specific ground truth and tailored features—which substantially improves generalization. Experimental results demonstrate statistically significant gains in accuracy over conventional POS taggers and state-of-the-art code analysis tools. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github
Problem

Research questions and friction points this paper is trying to address.

Develops a part-of-speech tagger for source code identifiers
Trains model to recognize natural language in developer-created identifiers
Compares performance with existing taggers for identifier annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GradientBoostingClassifier for training
Specializes in annotating source code identifiers
Improves accuracy over existing part-of-speech taggers
🔎 Similar Papers
No similar papers found.
Christian D. Newman
Christian D. Newman
Rochester Institute of Technology
Software EngineeringSoftware MaintenanceSoftware LinguisticsProgram ComprehensionNLP
B
Brandon Scholten
Department of Computer Science, Kent State University, Kent, OH, USA
S
Sophia Testa
Department of Computer Science, Kent State University, Kent, OH, USA
J
Joshua A. C. Behler
Department of Computer Science, Kent State University, Kent, OH, USA
S
Syreen Banabilah
Department of Computer Science, Kent State University, Kent, OH, USA
Michael L. Collard
Michael L. Collard
Associate Professor of Computer Science, The University of Akron
Software EngineeringSoftware EvolutionSoftware MaintenanceComputer Science
M
M. J. Decker
Department of Software Engineering, Bowling Green State University, Bowling Green, OH, USA
Mohamed Wiem Mkaouer
Mohamed Wiem Mkaouer
University of Michigan-Flint
Software EngineeringSoftware QualitySBSERefactoringSmells
Marcos Zampieri
Marcos Zampieri
George Mason University
Computational LinguisticsNatural Language Processing
Eman Abdullah AlOmar
Eman Abdullah AlOmar
Stevens Institute of Technology
Software EngineeringSoftware QualityRefactoringArtificial IntelligenceLarge Language Models
R
Reem S. Alsuhaibani
Prince Sultan University Department of Software Engineering, Riyadh, Saudi Arabia
A
Anthony S Peruma
University of Hawaii at Manoa Department of Information and Computer Sciences, Honolulu, HI, USA
Jonathan I. Maletic
Jonathan I. Maletic
Professor of Computer Science, Kent State University
Software EngineeringSoftware EvolutionSoftware MaintenanceComputer Science