🤖 AI Summary
Existing dialogue evaluation benchmarks suffer from static design, poor temporal relevance, and insufficient multilingual coverage, limiting their ability to capture cross-linguistic and cross-cultural nuances and thereby compromising the reliability of LLM assessment. To address this, we propose MEDAL—the first meta-evaluation framework for constructing multilingual dialogue evaluation benchmarks, explicitly targeting dialogue assessment capability. MEDAL innovatively integrates a closed-loop paradigm combining multi-LLM collaborative generation, multidimensional automated analysis by GPT-4.1, and fine-grained human annotation, further enhanced by multi-agent collaboration, cross-lingual prompt engineering, and LLM-as-a-judge techniques. The resulting benchmark significantly improves linguistic diversity and cultural representativeness. Empirical analysis reveals systematic weaknesses of mainstream LLMs in identifying empathy- and reasoning-related dialogue flaws. Furthermore, MEDAL effectively discriminates between reasoning-capable and non-reasoning LLMs in evaluation performance.
📝 Abstract
As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.