🤖 AI Summary
This work addresses the surprising sensitivity of large language models to semantically invariant surface perturbations—such as variable renaming—in mathematical reasoning, which frequently causes answer flips. The authors propose the Mechanistic Perturbation Diagnostics (MPD) framework to systematically evaluate three open-source models on GSM8K and its semantically equivalent variants. They introduce the Cascading Amplification Index (CAI) to predict failure layers and establish a mechanistic failure taxonomy based on reparability: localized, distributed, and entangled. Integrating logit lens analysis, activation patching, component ablation, and targeted interventions—including steering vectors and layer fine-tuning—they construct a unified diagnostic and repair pipeline. Experiments reveal answer flip rates of 28.8%–45.1%; for Llama-3, 72% (43/60) of localized failures are recoverable via patching, and targeted repairs improve overall accuracy by 12.2% (Llama-3), 7.2% (Qwen), and 5.2% (Mistral).
📝 Abstract
Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.