Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models (LLMs) underperform in real-world scenarios due to incomplete information, the need for multi-turn interaction, and dynamic planning. Method: We introduce the first multi-turn interactive benchmark explicitly designed to evaluate logical consistency, proactive information acquisition, and strategic dialogue. It features a novel structured puzzle framework with deterministic, automated scoring—eliminating reliance on human annotation. Contribution/Results: The benchmark systematically stress-tests core capabilities: reasoning, instruction following, dialogue management, and active questioning. Evaluation across mainstream LLMs reveals critical deficiencies: 42% failure rate in planning, 38% deviation in instruction adherence, and 51% error rate in multi-step reasoning. Our work establishes a reproducible, scalable evaluation paradigm that precisely identifies key bottlenecks, providing empirical grounding and concrete guidance for designing and optimizing next-generation interactive reasoning models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in interactive reasoning and strategic dialogue

Assessing LLMs' ability to handle incomplete data and seek information

Benchmarking multi-turn tasks for reasoning and dialogue skills

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for multi-turn reasoning tasks

Deterministic scoring without human intervention

Evaluates instruction following and planning failures

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues