🤖 AI Summary
This work addresses the challenge that existing uncertainty estimates for language models—commonly based on maximum softmax probability—are often poorly calibrated and thus fail to reflect true confidence. The authors propose a novel approach that models the evolution of MLP activations across model layers as geometric trajectories, from which they extract scale-invariant eleven-dimensional path features. A sparse linear probe is then constructed on these features to quantify generation uncertainty. By analyzing representational dynamics along the depth dimension, the method uncovers sources of uncertainty obscured by the final output, enabling interpretable error localization. Evaluated on selective classification tasks, the approach substantially outperforms conventional baselines, achieving up to a 21-percentage-point improvement in Area Under the Risk-Coverage curve (AURC), with performance gains strongly correlated with the baseline’s calibration error.
📝 Abstract
The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape -- which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.