Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work challenges the reliability of static mechanistic localization in guiding parameter updates during post-training of large language models. By tracking the structural evolution of Transformer circuits throughout supervised fine-tuning, the study introduces three novel metrics—circuit distance, stability, and conflict—to reveal a “free evolution” phenomenon wherein mechanisms shift unpredictably over time, exposing the temporal lag inherent in static localization approaches. The analysis deconstructs the illusion of effectiveness in current methods, demonstrating empirically that static mechanisms fail to anticipate future model states. Consequently, the paper advocates for a forward-looking, dynamic mechanistic localization framework to better support interpretability-guided, efficient post-training strategies.

📝 Abstract

The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

static localization

parameter updates

circuit evolution

temporal latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanistic interpretability

circuit evolution

static localization