MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

📅 2025-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses four core challenges—instruction understanding, contextual memory, environmental awareness, and collaborative reasoning—that large language models (LLMs) face in realistic multi-turn human–AI dialogues. To this end, we introduce the first multi-turn dialogue benchmark explicitly designed for real-world scenarios. We systematically identify and formalize four recurrent difficulty categories. Our evaluation framework integrates meticulously curated human-authored dialogue trajectories, fine-grained rubrics, LLM-as-judge automated assessment, and statistical calibration—achieving high agreement with human judgments (Spearman’s ρ > 0.92). Empirical results reveal severe limitations: state-of-the-art models achieve sub-50% average accuracy on this benchmark, with the strongest performer—Claude 3.5 Sonnet (June 2024)—scoring only 41.4%. These findings expose fundamental bottlenecks in current LLMs’ capacity for complex, sustained dialogue.

Technology Category

Application Category

📝 Abstract
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.
Problem

Research questions and friction points this paper is trying to address.

Advanced Language Models
Multi-turn Conversations
Dialogue Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiChallenge
dialogue assessment
large language models
V
Ved Sirdeshmukh
Scale AI
Kaustubh Deshpande
Kaustubh Deshpande
Scale AI
J
Johannes Mols
Scale AI
Lifeng Jin
Lifeng Jin
Scale AI
Computational Linguistics
E
Ed-Yeremai Cardona
Scale AI
Dean Lee
Dean Lee
Facility for Rare Isotope Beams and Department of Physics and Astronomy, Michigan State University
lattice effective field theorynuclear structurenuclear reactionsfew- and many-body systemscold atoms
J
Jeremy Kritz
Scale AI
W
Willow Primack
Scale AI
S
Summer Yue
Scale AI
C
Chen Xing
Scale AI