TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing LLM evaluation benchmarks predominantly focus on single-turn tasks, failing to adequately assess multi-turn iterative reasoning. This work introduces the first code-breaking–style benchmark explicitly designed for multi-turn interaction and multi-step reasoning. It employs a dual-mechanism design—“feedback-driven” interaction and “hidden-rule” constraints—supporting two difficulty modes (Classic and Nightmare) and providing ground-truth annotations of intermediate reasoning steps to mitigate data contamination. Inspired by Turing-machine–themed tabletop games, the benchmark integrates structured feedback modeling, multi-turn state tracking, and fine-grained reasoning-chain evaluation. Experiments reveal stark performance gaps: state-of-the-art LLMs achieve only 81.5% accuracy in Classic mode and plummet to 17.8% in Nightmare mode, whereas human participants attain 100%—unambiguously exposing fundamental limitations in LLMs’ long-horizon consistent reasoning.

Technology Category

Application Category

📝 Abstract

Despite impressive advances in large language models (LLMs), existing benchmarks often focus on single-turn or single-step tasks, failing to capture the kind of iterative reasoning required in real-world settings. To address this limitation, we introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task inspired by a"Turing Machine Board Game."In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds. This dynamic setup requires models to reason over time, adapt based on past information, and maintain consistency across steps-capabilities underexplored in current benchmarks. TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains. To support fine-grained analysis, we provide ground-truth annotations for intermediate reasoning steps. Our evaluation of state-of-the-art LLMs reveals significant gaps: the best model achieves 81.5% accuracy in Classic mode, but performance drops to 17.8% in Nightmare mode. In contrast, human participants achieve 100% in both, underscoring the challenge TurnBench poses to current models. By incorporating feedback loops and hiding task rules, TurnBench reduces contamination risks and provides a rigorous testbed for diagnosing and advancing multi-step, multi-turn reasoning in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-turn multi-step reasoning in LLMs

Addressing lack of iterative reasoning benchmarks

Testing model adaptability and consistency over rounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive code-breaking task for reasoning

Multi-turn feedback for adaptive learning

Dual modes to test varying complexity

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues