🤖 AI Summary
This study addresses the challenge of disentangling reasoning errors from failures in answer token binding in multiple-choice question answering. By combining representational analyses—using PCA and linear probing—with causal interventions such as token/content swapping tests, the work provides the first clear evidence for a two-stage mechanism within language models: models first encode the correct answer content in residual states near option boundaries (at positions preceding output token generation) and subsequently bind this content to the corresponding answer symbol. This finding elucidates the internal computational process underlying multiple-choice QA, clarifies the model’s operational dynamics, and opens new avenues for diagnosing and improving model performance.
📝 Abstract
Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that *represents* the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning *content position* becomes decodable immediately after the final option is processed, while the *output symbol* is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.