🤖 AI Summary
This study investigates whether correctness and uncertainty in large language models arise from shared or dissociable internal representations. Leveraging Llama-3.1-8B and Gemma-2-9B, the authors construct a 2×2 analytical framework combined with sparse autoencoders to identify distinct neural features in intermediate layers associated with correctness and confidence. They discover three functionally distinct feature clusters that separately encode correctness, uncertainty, and confounding signals. Through feature ablation, AUROC evaluation, and selective abstention experiments, they validate the roles of these clusters. Targeted suppression of confounding features improves accuracy by 1.1% and reduces output entropy by 75%. Remarkably, selective abstention using only three features boosts accuracy from 62% to 81% at 53% coverage.
📝 Abstract
Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.