Do LLMs Signal When They're Right? Evidence from Neuron Agreement

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether internal neuron activation patterns in large language models (LLMs) can serve as an unsupervised signal for answer correctness—bypassing reliance on external outputs (e.g., token probabilities, entropy, or self-evaluation). To address this limitation, we propose Neuron Agreement Decoding (NAD), the first method to formalize “neuron consensus” as a reliability metric, grounded in two empirical observations: low activation sparsity and high cross-sample consistency. NAD enables label-free early stopping and ensemble inference by jointly modeling activation sparsity, consistency, and integration within a best-of-N framework. Evaluated on mathematical and scientific reasoning benchmarks, NAD matches majority voting in accuracy; on code generation tasks, it significantly outperforms Avg@64 while reducing token consumption by 99% without sacrificing output quality. This establishes a novel paradigm for efficient, open-domain, unlabeled LLM inference.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.
Problem

Research questions and friction points this paper is trying to address.

Analyzing internal neuron activations to predict LLM correctness
Developing unsupervised decoding using neuron agreement and sparsity
Reducing computational costs while maintaining generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes neuron activations for internal behavior insights
Uses activation sparsity and cross-sample agreement for selection
Enables early correctness prediction with aggressive stopping
🔎 Similar Papers
No similar papers found.
K
Kang Chen
Institute of Trustworthy Embodied AI, Fudan University
Y
Yaoning Wang
Institute of Trustworthy Embodied AI, Fudan University
K
Kai Xiong
Harbin Institute of Technology
Z
Zhuoka Feng
Institute of Trustworthy Embodied AI, Fudan University
W
Wenhe Sun
Institute of Trustworthy Embodied AI, Fudan University
Haotian Chen
Haotian Chen
University of California, Los Angeles
Political EconomyNon-market StrategyAmerican Politics
Y
Yixin Cao
Institute of Trustworthy Embodied AI, Fudan University