A Multi-Language Perspective on the Robustness of LLM Code Generation

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing robustness evaluations of code generation models are overly focused on Python and lack comprehensive multi-language assessment. Method: We systematically benchmark mainstream large language models across Java, C++, and JavaScript, proposing a four-dimensional prompt perturbation framework—encompassing DocString, function name, syntax, and formatting—and conduct cross-lingual adversarial experiments. We further introduce MultiRobustCode, the first open-source, multi-language robustness benchmark for code generation. Contribution/Results: Our study reveals pronounced language-dependent robustness degradation (Java/C++ underperform Python), with DocString perturbations exerting the strongest impact. Model stability exhibits quantifiable, language-specific rankings. This work pioneers multi-language robustness evaluation for code generation, establishing a new paradigm and foundational infrastructure for assessing model generalization beyond Python-centric settings.

Technology Category

Application Category

📝 Abstract

Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the robustness of code generation models remains an ongoing endeavor. Previous studies have primarily focused on code generation models specifically for the Python language, overlooking other widely used programming languages. In this research, we conduct a comprehensive comparative analysis to assess the robustness performance of several prominent code generation models. Furthermore, we investigate how their performance varies across different programming languages. To accomplish this, we introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. We have compiled and released a dedicated dataset for this purpose. This work presents our experimental findings, shedding light on the performance of code generation models in various scenarios.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM code generation across multiple languages

Investigating performance variations under different prompt perturbations

Addressing lack of multi-language evaluation in prior studies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative analysis of code generation model robustness

Introduce perturbations in DocString, function name, syntax, format

Dataset compiled for multi-language robustness evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow