Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
It remains unclear whether the factual recall (i.e., retrieval) and multi-step reasoning capabilities of Transformer models are supported by distinct internal mechanisms. Method: We construct a controllable synthetic language-puzzle dataset and apply causal interventions—at the layer, attention-head, and neuron levels—including activation patching, structured ablation, and fine-grained attribution—to isolate functional circuits in Qwen and LLaMA. Contribution/Results: We provide the first empirical evidence that recall and reasoning operate via separable computational circuits: selectively ablating the recall circuit reduces factual accuracy by up to 15% without impairing reasoning performance, and vice versa. We identify functionally specialized subnetworks and propose an asymmetric intervention framework. This work establishes a new paradigm for interpretable modeling, targeted model evaluation, and safety-aware model steering.

Technology Category

Application Category

📝 Abstract
Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified "recall circuits" reduces fact-retrieval accuracy by up to 15% while leaving reasoning intact, whereas disabling "reasoning circuits" reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing recall from reasoning mechanisms in transformer models
Identifying separable circuits for fact retrieval versus multi-step inference
Using causal interventions to understand model cognition for safer deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise attention analysis distinguishes recall and reasoning circuits
Activation patching and ablations causally measure component contributions
Task-specific firing patterns reveal separable but interacting circuits
🔎 Similar Papers
No similar papers found.