🤖 AI Summary
This work investigates whether large language models (LLMs) exhibit cognitive dissonance—systematic inconsistency between their stated answers to multiple-choice questions (MCQs) and their revealed beliefs, operationalized as next-token probability distributions over answer options.
Method: We propose a novel evaluation framework grounded in raw text completion, distinguishing and quantifying the stated answer versus revealed belief through multi-outcome scenario design, token-level probability analysis, and controlled prompt ablation experiments.
Contribution/Results: We find that LLMs frequently select correct answers while simultaneously exhibiting biases in their probability distributions—including causal misattribution, miscalibrated uncertainty estimation, and sluggish evidence updating—thereby exposing critical reasoning flaws obscured by conventional MCQ accuracy metrics. Our findings challenge the reliability of unstructured generative outputs as proxies for robust reasoning and provide both theoretical grounding and empirical evidence for developing more trustworthy, belief-aware evaluation paradigms for LLMs.
📝 Abstract
Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs' apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.