Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the widespread assumption in existing agent code evaluation benchmarks that English is the default language, overlooking how language choice affects model performance rankings. For the first time, it systematically investigates the interaction between judge language and model architecture by fully localizing the Agent-as-a-Judge prompting stack into five typologically diverse languages and conducting requirement-level evaluations across multiple developer agent frameworks and judge models. Results reveal no single model dominates across all languages—e.g., GPT-4o excels in English while Gemini leads in Arabic and Hindi. Partial localization significantly reduces satisfaction scores (e.g., Hindi drops from 42.8% to 23.2%), and inter-model judgment consistency remains low (κ ≤ 0.231), underscoring the necessity of fully localized judge instructions.
📝 Abstract
Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge's language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72\%), while Gemini leads in Arabic (51.72\%, $p<0.001$ vs.\ GPT-4o) and Hindi (53.22\%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss' $κ\leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8\% to 23.2\% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Agent-as-a-Judge
multilingual evaluation
language sensitivity
backbone ranking
requirement-level evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Prompt Localization
Agent-as-a-Judge
Language Sensitivity
Backbone Evaluation
Requirement-Level Judgment