🤖 AI Summary
Existing automated evaluation methods suffer from severe distortion in multi-turn incomplete-information lateral reasoning tasks, being vulnerable to shortcut learning, pattern rigidity, and premature termination—leading to misjudgment of LLMs’ true reasoning capabilities. To address this, we propose the first dedicated evaluation framework for such scenarios, integrating reasoning-path auditing, multidimensional dynamic metrics (including cognitive alignment), and human–model comparative analysis, augmented by expert annotation and real-time path tracing. We systematically identify and attribute the “evaluation hallucination” phenomenon—the significant divergence between automated scores and human cognitive judgments—for the first time. Experiments across diverse lateral reasoning benchmarks confirm the ubiquity of this hallucination, substantially improving evaluation fidelity, precisely localizing model reasoning failures, and providing both theoretical foundations and practical tools for reliable assessment and model refinement.
📝 Abstract
Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.