π€ AI Summary
This work addresses the scarcity of high-quality automatic evaluation methods for large language models in non-English settings, a challenge exacerbated by the limited availability and high cost of human-annotated data in target languages. To overcome this, the paper proposes a cross-lingual evaluation framework based on evaluation decomposition. It introduces, for the first time, a language-agnostic Universal Criteria Set (UCS) that enables cross-lingual transfer without requiring human annotations in the target language by generating interpretable intermediate representations. Integrating large language modelβbased automatic judgment with transfer learning, the framework substantially outperforms strong baselines across multiple languages and model architectures on faithfulness evaluation tasks, while simultaneously enhancing the interpretability and generalization capability of the evaluation system.
π Abstract
As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.