Metamorphic Testing of Deep Code Models: A Systematic Literature Review

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep code models exhibit inconsistent outputs under semantics-preserving transformations (e.g., variable renaming), revealing critical robustness deficiencies in real-world software engineering contexts. Method: This work presents the first systematic literature review on mutation testing for code intelligence models, analyzing 45 studies to synthesize program transformation strategies, coverage metrics, adversarial sample generation techniques, and robustness evaluation frameworks. Contribution/Results: The review characterizes distributions and key challenges across model architectures, downstream tasks, programming languages (predominantly Python and Java), and evaluation metrics. It identifies widely adopted benchmarks (e.g., CodeXGLUE, HumanEval) and standard evaluation protocols. Furthermore, it proposes three future research directions: multi-granularity semantic-preserving transformations, dynamic coverage-guided mutation, and task-aware robustness assessment. Collectively, this study provides both theoretical foundations and practical guidelines for enhancing the reliability and deployment readiness of code models in industrial settings.

Technology Category

Application Category

📝 Abstract
Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models' robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field.
Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of deep code models via metamorphic testing
Analyzing semantic-preserving transformations in code models
Identifying challenges in testing deep learning for code tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metamorphic testing for deep code models
Semantic-preserving input transformations
Robustness evaluation via output stability