🤖 AI Summary
Large language models (LLMs) frequently exhibit a “knowledge–prediction gap” on multiple-choice questions (MCQs): although correct knowledge is implicitly encoded in their representations, they output incorrect predictions. This work is the first to reveal—through a geometric lens—that this gap arises from misalignment between the subspaces encoding factual knowledge and those governing prediction decisions. To address this, we propose KAPPA, a zero-parameter intervention method that identifies knowledge- and prediction-relevant subspaces via residual stream probing and aligns their hidden-state coordinates via orthogonal projection. KAPPA requires no fine-tuning and generalizes across tasks. Evaluated on binary-choice benchmarks—including Big-Bench-Hard and ARC-Challenge—KAPPA significantly improves accuracy over state-of-the-art baselines. We further demonstrate the cross-dataset generalizability of the identified subspace structure and show KAPPA’s extensibility to open-ended generation tasks.
📝 Abstract
Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model's hidden states along these two bases. Hence, we introduce extbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.