🤖 AI Summary
This work addresses the challenge of effectively translating interpretable features uncovered by internal interpretability tools—such as sparse autoencoders—into performance-enhancing training strategies. To this end, it introduces Interpretability-Guided Data Selection (IGDS), a novel framework that establishes, for the first time, a direct pathway from internal interpretable features to data selection. IGDS identifies task-relevant internal features through frequency tracing and intervention filtering, then selects “feature-resonant” data that strongly activates these features for fine-tuning. Experiments on Gemma-2, LLaMA-3.1, and Qwen3 demonstrate that IGDS achieves superior performance using only 50% of the training data: on mathematical reasoning tasks, Gemma-2-2B fine-tuned with IGDS outperforms full-data fine-tuning by 17.4%, significantly surpassing existing baselines based on data quality and diversity.
📝 Abstract
While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model's internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects ``Feature-Resonant Data'' that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable 17.4% on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.