MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluation frameworks focus predominantly on single-turn reasoning, lacking systematic assessment of multi-turn interactive reasoning. Method: We introduce MTR-Bench—the first comprehensive benchmark for multi-turn reasoning—comprising 4 task categories, 40 subtasks, and 3,600 instances, emphasizing environment interaction and fine-grained difficulty stratification. We design a fully automated multi-turn evaluation framework integrating programmatically generated prompts, simulated interactive environments, structured answer verification, and scalable scoring protocols to enable end-to-end automation in data construction and evaluation. Contribution/Results: Empirical evaluation reveals substantial performance degradation of state-of-the-art reasoning models on multi-turn tasks. MTR-Bench fills a critical gap in interactive reasoning evaluation, providing a reproducible benchmark and actionable insights for model diagnostics and next-generation interactive AI research.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive datasets for multi-turn reasoning evaluation
Absence of scalable automatic evaluation protocols for interactive tasks
Current LLMs underperform in multi-turn interactive reasoning scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive multi-turn reasoning benchmark
Fully-automated evaluation framework
Diverse tasks with fine-grained difficulty