Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work systematically investigates, for the first time, semantic equivalence bias in large language models (LLMs) acting as code evaluators under reference-free settings: whether LLMs assign fair scores to functionally equivalent programs exhibiting superficial differences (e.g., variable names, comments, formatting). We define six categories of evaluation bias and conduct zero-shot and few-shot evaluation experiments across five programming languages using multilingual code perturbations on multiple state-of-the-art LLMs. Statistical significance testing and quantitative bias analysis are employed. Results reveal pervasive, systematic positive and negative biases across languages and models—unmitigated by test-case generation prompts—and demonstrate substantial score distortion, undermining LLM-based code assessment reliability. This study provides critical empirical evidence and a theoretical framework for developing robust, fair automated code evaluation methods.

Technology Category

Application Category

📝 Abstract

With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.

Problem

Research questions and friction points this paper is trying to address.

Assessing biases in LLM judges for code evaluation

Exploring fairness in evaluating semantically equivalent code variations

Investigating robustness of LLMs in scoring superficial code differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Defining six types of code evaluation biases

Testing LLM judges across five programming languages

Analyzing biases in scoring without reference implementations

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models