No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) can predict the correctness of their own answers prior to output generation. We propose training a linear probe on hidden-layer activations—specifically, intermediate transformer layer states—to directly forecast answer accuracy before token emission. Experiments reveal a stable, low-dimensional direction in mid-layer representations that consistently encodes both answer correctness and intrinsic confidence, indicating that self-assessment capability emerges naturally during forward computation. This direction generalizes robustly across diverse LLMs and knowledge-intensive benchmarks (e.g., TruthfulQA, FEVER), substantially outperforming black-box baselines and explicit confidence calibration methods. However, generalization degrades on mathematical reasoning tasks (e.g., GSM8K). Our key contribution is the first empirical demonstration that LLMs internally develop a linearly separable “self-assessment representation,” offering a foundational mechanism for trustworthy AI and dynamic inference control.

Technology Category

Application Category

📝 Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers, suggesting that self-assessment emerges mid-computation. Notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

Problem

Research questions and friction points this paper is trying to address.

Predicting LLM answer accuracy from pre-output activations

Assessing self-assessment capability in language models

Generalizing correctness probes across diverse knowledge domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear probes predict answer correctness

Probes use pre-generation activations

Self-assessment emerges mid-computation

🔎 Similar Papers

No similar papers found.