Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated evaluation methods suffer from severe distortion in multi-turn incomplete-information lateral reasoning tasks, being vulnerable to shortcut learning, pattern rigidity, and premature termination—leading to misjudgment of LLMs’ true reasoning capabilities. To address this, we propose the first dedicated evaluation framework for such scenarios, integrating reasoning-path auditing, multidimensional dynamic metrics (including cognitive alignment), and human–model comparative analysis, augmented by expert annotation and real-time path tracing. We systematically identify and attribute the “evaluation hallucination” phenomenon—the significant divergence between automated scores and human cognitive judgments—for the first time. Experiments across diverse lateral reasoning benchmarks confirm the ubiquity of this hallucination, substantially improving evaluation fidelity, precisely localizing model reasoning failures, and providing both theoretical foundations and practical tools for reliable assessment and model refinement.

Technology Category

Application Category

📝 Abstract
Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating lateral thinking in LLMs with incomplete information
Identifying misleading results from current evaluation methods
Proposing refined standards for reliable LLM assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined evaluation standards for reasoning paths
Diversified metrics for comprehensive assessment
Comparative analyses with human performance benchmarks
🔎 Similar Papers
No similar papers found.