🤖 AI Summary
This study investigates the effectiveness and applicability boundaries of natural-language-oriented text data augmentation techniques for source code classification. Method: We systematically evaluate eight NLP-based augmentation methods—including back-translation, synonym replacement, and random insertion/deletion—across four representative code classification tasks and four model architectures: CNN, RNN, Transformer, and CodeBERT. Contribution/Results: Our empirical analysis provides the first evidence that augmentation strategies not strictly preserving syntactic correctness—departing from conventional syntax-preserving paradigms—can significantly improve both classification accuracy and model robustness. Surprisingly, certain mild syntax-disrupting techniques enhance generalization capability. We identify multiple high-performing augmentation combinations, offering novel design principles and empirical guidance for data augmentation in code representation learning.
📝 Abstract
Recent studies have shown surprising results of source code learning, which applies deep neural networks (DNNs) to various software engineering tasks. Like other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. In practice, data augmentation is a technique that produces additional training data to boost the model training and has been widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, based on the insight that source code can be represented sequentially as text data, we take an early step to investigate whether data augmentation methods originally for texts are effective for source code learning. To that end, we focus on code classification tasks and conduct a comprehensive empirical study on four critical code problems and four DNN architectures to assess the effectiveness of 8 data augmentation methods. Our results identify the data augmentation methods that can produce more accurate models for source code learning and show that the data augmentation methods are still useful even if they slightly break the syntax of source code.