🤖 AI Summary
Existing LLM evaluation frameworks focus predominantly on single-turn reasoning, lacking systematic assessment of multi-turn interactive reasoning. Method: We introduce MTR-Bench—the first comprehensive benchmark for multi-turn reasoning—comprising 4 task categories, 40 subtasks, and 3,600 instances, emphasizing environment interaction and fine-grained difficulty stratification. We design a fully automated multi-turn evaluation framework integrating programmatically generated prompts, simulated interactive environments, structured answer verification, and scalable scoring protocols to enable end-to-end automation in data construction and evaluation. Contribution/Results: Empirical evaluation reveals substantial performance degradation of state-of-the-art reasoning models on multi-turn tasks. MTR-Bench fills a critical gap in interactive reasoning evaluation, providing a reproducible benchmark and actionable insights for model diagnostics and next-generation interactive AI research.
📝 Abstract
Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.