🤖 AI Summary
Large language models (LLMs) underperform in real-world scenarios due to incomplete information, the need for multi-turn interaction, and dynamic planning. Method: We introduce the first multi-turn interactive benchmark explicitly designed to evaluate logical consistency, proactive information acquisition, and strategic dialogue. It features a novel structured puzzle framework with deterministic, automated scoring—eliminating reliance on human annotation. Contribution/Results: The benchmark systematically stress-tests core capabilities: reasoning, instruction following, dialogue management, and active questioning. Evaluation across mainstream LLMs reveals critical deficiencies: 42% failure rate in planning, 38% deviation in instruction adherence, and 51% error rate in multi-step reasoning. Our work establishes a reproducible, scalable evaluation paradigm that precisely identifies key bottlenecks, providing empirical grounding and concrete guidance for designing and optimizing next-generation interactive reasoning models.
📝 Abstract
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.