Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently fail on reasoning tasks under input perturbations, revealing overreliance on memorized knowledge—particularly in chain-of-thought (CoT) reasoning, where spurious memories induce erroneous intermediate steps that propagate to final answers. Method: We propose STIM, a fine-grained diagnostic framework that attributes token-level memory sources to pretrained corpus co-occurrence statistics, distinguishing local, medium-range, and long-range memory. STIM combines multi-scale context matching with statistical dependency modeling to pinpoint memorized reasoning steps. Contribution/Results: We identify local memory as the primary driver of error propagation: up to 67% of erroneous tokens originate from it. STIM consistently correlates local memory with complex and long-tail samples across diverse tasks and data distributions. Moreover, it accurately predicts faulty reasoning steps, providing an interpretable, memory-aware diagnostic tool to enhance CoT robustness.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Identifying memorization sources in Chain-of-Thought reasoning errors
Analyzing token-level memorization impact on model performance
Predicting wrong reasoning steps using memorization scores
Innovation

Methods, ideas, or system contributions that make the work stand out.

STIM framework identifies token-level memorization sources
Analyzes local, mid-range, long-range memorization patterns
Memorization scores predict errors in reasoning steps
🔎 Similar Papers
No similar papers found.
H
Huihan Li
University of Southern California
Y
You Chen
University of California, San Diego
S
Siyuan Wang
University of Southern California
Y
Yixin He
University of Southern California
Ninareh Mehrabi
Ninareh Mehrabi
Amazon
AI SafetyResponsible AI
R
Rahul Gupta
Amazon AGI
Xiang Ren
Xiang Ren
ExxonMobil
Computational Mechanics