🤖 AI Summary
This study addresses the unclear separation between language generation and semantic preservation mechanisms in large language models (LLMs) for machine translation, which limits interpretability. By analyzing the role of attention heads at the sentence level, the work decouples translation into two subtasks—target language generation and semantic equivalence—and identifies distinct sparse sets of attention heads responsible for each. It reveals, for the first time at the sentence level, a functional dissociation between linguistic form and semantic content. Building on this insight, the authors propose a subtask-based steering vector method that enables precise control over translation without relying on instruction prompts. Modifying only about 1% of critical attention heads achieves performance comparable to prompt-based approaches, while selective ablation accurately disrupts specific functions. The approach is validated across three open-source LLMs and 20 language directions.
📝 Abstract
Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.