Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs

📅 2026-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While large language models (LLMs) demonstrate strong performance in translation tasks, the internal mechanisms that trigger such capabilities remain poorly understood. This work employs sparse autoencoders to identify and validate critical “activation features” within LLMs that initiate translation behavior. The functional role of these features is confirmed through causal interventions—specifically, amplification and ablation—and further supported by principal component analysis (PCA) consistency checks. Leveraging this mechanistic insight, the study proposes a data selection strategy targeting challenging translation samples, which substantially improves fine-tuning efficiency and mitigates hallucination. Moreover, the identified activation mechanism exhibits transferability, as evidenced by its successful validation on larger-scale models within the same family.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.
Problem

Research questions and friction points this paper is trying to address.

translation initiation
Large Language Models
mechanistic interpretability
task-specific features
innate translation competency
Innovation

Methods, ideas, or system contributions that make the work stand out.

translation initiation features
Sparse Autoencoders
mechanistic interpretability
data-efficient fine-tuning
causal intervention