WakenLLM: A Fine-Grained Benchmark for Evaluating LLM Reasoning Potential and Reasoning Process Stability

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “vague perception” problem arising from frequent “Unknown” outputs by large language models (LLMs)—specifically, the difficulty in distinguishing between *inherently ill-posed questions* and *model capability limitations*. We propose the first fine-grained evaluation framework for this issue. Methodologically, we decouple *undecidability* from *unknown solvability*, design multi-strategy prompting stimuli—including chain-of-thought reconstruction and uncertainty probing—and integrate theoretical accuracy analysis with response attribution to assess both reasoning potential and process stability. Our contributions are threefold: (1) a principled, interpretable attribution mechanism for “Unknown” responses; (2) empirical evidence that a non-negligible fraction of “Unknown” outputs conceal latent correct answers amenable to elicitation; and (3) systematic benchmarking across multiple datasets, revealing concrete reasoning boundaries and improvement margins of state-of-the-art LLMs—thereby establishing a novel paradigm for evaluating model honesty and reasoning capacity.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) frequently output the label emph{Unknown}, yet current evaluations focus almost exclusively on whether such answers are emph{honest} rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon emph{Vague Perception}. And thus we introduce a framework that quantifies the proportion of emph{Unknown} responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct (emph{Known}) or intrinsically indeterminate outcomes. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the true reasoning ability of LLMs and providing a new perspective on solving the emph{Vague Perception} phenomenon.
Problem

Research questions and friction points this paper is trying to address.

Distinguishes genuine indeterminate inputs from model incapacity in LLMs
Quantifies Unknown responses due to model limitations versus solvable problems
Evaluates LLM reasoning potential and stability through guided stimulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies Unknown responses via model incapacity
Tests guided stimulation for Known conversion
Separates uncertainty sources for clearer LLM limits
🔎 Similar Papers
No similar papers found.