🤖 AI Summary
This study addresses the challenges of catastrophic forgetting in supervised fine-tuning (SFT) and the unclear internal mechanisms underlying instruction-following capabilities in language models. By integrating information-theoretic analysis, geometric metrics, and optimization trajectory inspection across models ranging from 1B to 32B parameters, the work reveals—for the first time—that instruction alignment exhibits architectural locality: representations in intermediate layers (20%–80% depth) remain stable, while those in the final layers are highly sensitive. Building on this insight, the authors propose Mid-Block Efficient Tuning, which fine-tunes only critical intermediate layers. This approach achieves up to a 10.2% improvement over standard LoRA on GSM8K (using OLMo2-7B) while substantially reducing the number of trainable parameters.
📝 Abstract
While critical for alignment, Supervised Fine-Tuning (SFT) incurs the risk of catastrophic forgetting, yet the layer-wise emergence of instruction-following capabilities remains elusive. We investigate this mechanism via a comprehensive analysis utilizing information-theoretic, geometric, and optimization metrics across model scales (1B-32B). Our experiments reveal a distinct depth-dependent pattern: middle layers (20\%-80\%) are stable, whereas final layers exhibit high sensitivity. Leveraging this insight, we propose Mid-Block Efficient Tuning, which selectively updates these critical intermediate layers. Empirically, our method outperforms standard LoRA up to 10.2\% on GSM8K (OLMo2-7B) with reduced parameter overhead, demonstrating that effective alignment is architecturally localized rather than distributed. The code is publicly available at https://anonymous.4open.science/r/base_sft.