Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing automated evaluation methods suffer from severe distortion in multi-turn incomplete-information lateral reasoning tasks, being vulnerable to shortcut learning, pattern rigidity, and premature termination—leading to misjudgment of LLMs’ true reasoning capabilities. To address this, we propose the first dedicated evaluation framework for such scenarios, integrating reasoning-path auditing, multidimensional dynamic metrics (including cognitive alignment), and human–model comparative analysis, augmented by expert annotation and real-time path tracing. We systematically identify and attribute the “evaluation hallucination” phenomenon—the significant divergence between automated scores and human cognitive judgments—for the first time. Experiments across diverse lateral reasoning benchmarks confirm the ubiquity of this hallucination, substantially improving evaluation fidelity, precisely localizing model reasoning failures, and providing both theoretical foundations and practical tools for reliable assessment and model refinement.

Technology Category

Application Category

📝 Abstract

Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating lateral thinking in LLMs with incomplete information

Identifying misleading results from current evaluation methods

Proposing refined standards for reliable LLM assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined evaluation standards for reasoning paths

Diversified metrics for comprehensive assessment

Comparative analyses with human performance benchmarks

🔎 Similar Papers

No similar papers found.