MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing dialogue evaluation benchmarks suffer from static design, poor temporal relevance, and insufficient multilingual coverage, limiting their ability to capture cross-linguistic and cross-cultural nuances and thereby compromising the reliability of LLM assessment. To address this, we propose MEDAL—the first meta-evaluation framework for constructing multilingual dialogue evaluation benchmarks, explicitly targeting dialogue assessment capability. MEDAL innovatively integrates a closed-loop paradigm combining multi-LLM collaborative generation, multidimensional automated analysis by GPT-4.1, and fine-grained human annotation, further enhanced by multi-agent collaboration, cross-lingual prompt engineering, and LLM-as-a-judge techniques. The resulting benchmark significantly improves linguistic diversity and cultural representativeness. Empirical analysis reveals systematic weaknesses of mainstream LLMs in identifying empathy- and reasoning-related dialogue flaws. Furthermore, MEDAL effectively discriminates between reasoning-capable and non-reasoning LLMs in evaluation performance.

Technology Category

Application Category

📝 Abstract

As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.

Problem

Research questions and friction points this paper is trying to address.

Lack of dynamic multilingual benchmarks for chatbot evaluation

Insufficient capture of linguistic and cultural variations in dialogues

LLMs' difficulty in detecting nuanced dialogue quality issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-agent framework for dialogue evaluation

Leverages state-of-the-art LLMs for multilingual dialogues

Uses GPT-4.1 for multidimensional performance analysis

🔎 Similar Papers

No similar papers found.