The Geometry of Self-Verification in a Task-Specific Reasoning Model

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

How do large language models autonomously verify the correctness of their reasoning outputs? Method: We reproduce DeepSeek R1’s training paradigm on the CountDown task, inducing mode collapse via preference fine-tuning to elicit structured chain-of-thought reasoning and uncover a bidirectional self-verification mechanism. Using top-layer weight interpretability tracing, bottom-layer attention attribution, and cross-layer communication analysis, we probe the internal geometry of verification. Contribution/Results: We discover, for the first time, weight vectors in GLU layers that encode verification semantics (e.g., “success”/“incorrect”) and identify “pre-token attention heads” as the core verification components. Crucially, we demonstrate that ablating just three critical attention heads precisely disables verification functionality. This work provides the first circuit-level decomposition of self-verification—localizing verification-relevant token representations and functional subcircuits—establishing the first interpretability-driven empirical foundation for trustworthy reasoning modeling.

Technology Category

Application Category

📝 Abstract

How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, resulting in a model that always produces highly structured and easily parse-able chain-of-thought sequences. With this setup, we do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs. Our top-down analysis reveals Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect'', which activate according to the correctness of the model's reasoning steps. Our bottom-up analysis reveals that ``previous-token heads'' are mainly responsible for model verification. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU vectors to localize as few as three attention heads that can disable model verification, pointing to a necessary component of a potentially larger verification circuit.

Problem

Research questions and friction points this paper is trying to address.

How reasoning models verify their own answers

Analyze GLU weights encoding verification tokens

Identify attention heads disabling model verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference tuning induces structured chain-of-thought sequences

GLU weights encode verification-related token patterns

Three attention heads disable model verification circuit

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting