High-quality data augmentation for code comment classification

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work addresses the challenge of code comment intent classification, which is hindered by scarce labeled data and severe class imbalance, thereby limiting model performance. To mitigate these issues, the authors propose Q-SYNTH, a method that generates high-quality synthetic data to perform targeted oversampling and augmentation on the NLBSE'26 challenge dataset. By integrating natural language processing and deep learning techniques, Q-SYNTH effectively alleviates data scarcity and distribution skew. Evaluated on standard classifiers, the approach achieves a 2.56% absolute improvement in accuracy over baseline methods, demonstrating the efficacy and novelty of synthetic data generation for enhancing comment intent classification performance.

Technology Category

Application Category

📝 Abstract

Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers'insights about the surrounding source code, serving as an essential resource for both human comprehension and automated analysis. Nevertheless, since comments are in natural language, they present challenges for machine-based code understanding. To address this, recent studies have applied natural language processing (NLP) and deep learning techniques to classify comments according to developers'intentions. However, existing datasets for this task suffer from size limitations and class imbalance, as they rely on manual annotations and may not accurately represent the distribution of comments in real-world codebases. To overcome this issue, we introduce new synthetic oversampling and augmentation techniques based on high-quality data generation to enhance the NLBSE'26 challenge datasets. Our Synthetic Quality Oversampling Technique and Augmentation Technique (Q-SYNTH) yield promising results, improving the base classifier by $2.56\%$.

Problem

Research questions and friction points this paper is trying to address.

code comment classification

data augmentation

class imbalance

dataset limitation

natural language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

data augmentation

code comment classification

synthetic oversampling