Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the confounding of genuine cross-lingual performance gaps with evaluation instability in multilingual model assessment. To isolate the intrinsic stability of evaluation methodologies, the authors propose a controlled generation paradigm that produces synthetic customer service dialogues in Estonian, Finnish, and Hungarian with identical parameters. Combining automatic metrics, LLM-as-a-judge evaluations, and native-speaker annotations, they find that surface-level measures—such as lexical diversity and semantic similarity—exhibit cross-lingual stability, whereas zero-shot judgments of pragmatic qualities like coherence and instruction following show substantial instability, including rank reversals and near-zero inter-language correlations. These findings indicate that current automatic evaluation methods require language-specific calibration for morphologically rich languages and underscore the value of the proposed paradigm as a diagnostic tool for cross-lingual evaluation robustness.

Technology Category

Application Category

📝 Abstract

Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual evaluation

LLM-as-a-judge

evaluation stability

Finno-Ugric languages

controlled generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual stability

LLM-as-a-judge

controlled generation