EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing evaluations of LLMs’ programming capabilities over-rely on static accuracy metrics and neglect robustness; adversarial attack-based methods suffer from limited efficacy and poor cross-model comparability. Method: We propose EVALOOP, the first robustness evaluation framework based on a self-consistent closed-loop paradigm. Leveraging the natural duality between code generation and code summarization, EVALOOP constructs an attack-free feedback loop integrating program synthesis, multi-round self-feedback reasoning, and pass@1 decay analysis. Contribution/Results: EVALOOP enables automated robustness assessment across 16 mainstream LLMs. Experiments show that 10-loop iterations induce absolute pass@1 degradation of 5.01%–19.31%, revealing substantial misalignment between initial performance and robustness (e.g., GPT-3.5-Turbo exhibits lower robustness than DeepSeek-V2). EVALOOP establishes the first cross-model, unified, and reproducible robustness benchmark for programming tasks.

Technology Category

Application Category

📝 Abstract

Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs' robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM robustness in programming tasks

Current evaluations neglect model robustness in coding

Adversarial attack methods yield inconsistent robustness results

Innovation

Methods, ideas, or system contributions that make the work stand out.

EVALOOP assesses LLM robustness via self-consistency loops

Leverages code-generation-summarization duality for evaluation

Provides unified metric without external attack setups

🔎 Similar Papers

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code