🤖 AI Summary
Existing robustness evaluations of code generation models are overly focused on Python and lack comprehensive multi-language assessment. Method: We systematically benchmark mainstream large language models across Java, C++, and JavaScript, proposing a four-dimensional prompt perturbation framework—encompassing DocString, function name, syntax, and formatting—and conduct cross-lingual adversarial experiments. We further introduce MultiRobustCode, the first open-source, multi-language robustness benchmark for code generation. Contribution/Results: Our study reveals pronounced language-dependent robustness degradation (Java/C++ underperform Python), with DocString perturbations exerting the strongest impact. Model stability exhibits quantifiable, language-specific rankings. This work pioneers multi-language robustness evaluation for code generation, establishing a new paradigm and foundational infrastructure for assessing model generalization beyond Python-centric settings.
📝 Abstract
Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this research, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models. Furthermore, we investigate how their performance varies across different programming languages. To accomplish this, we introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. We have compiled and released a dedicated dataset for this purpose. This work presents our experimental findings, shedding light on the performance of code generation models in various scenarios.