🤖 AI Summary
This study investigates how large language models internally represent and process instructions during supervised fine-tuning (SFT) and direct preference optimization (DPO). Through causal mediation analysis, we find that instruction representations are highly localized in early network layers and introduce the concept of an “instruction vector”—a representation that effectively guides later layers to select task-relevant information pathways even under conditions of linear non-separability. Our work challenges the prevailing assumption in mechanistic interpretability that internal representations are linearly encoded, and instead proposes a novel method for identifying causal information pathways without relying on linearity. This reveals the instruction vector’s critical role as a selector of task-specific circuits within the model.
📝 Abstract
Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.