From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the challenge of effectively translating interpretable features uncovered by internal interpretability tools—such as sparse autoencoders—into performance-enhancing training strategies. To this end, it introduces Interpretability-Guided Data Selection (IGDS), a novel framework that establishes, for the first time, a direct pathway from internal interpretable features to data selection. IGDS identifies task-relevant internal features through frequency tracing and intervention filtering, then selects “feature-resonant” data that strongly activates these features for fine-tuning. Experiments on Gemma-2, LLaMA-3.1, and Qwen3 demonstrate that IGDS achieves superior performance using only 50% of the training data: on mathematical reasoning tasks, Gemma-2-2B fine-tuned with IGDS outperforms full-data fine-tuning by 17.4%, significantly surpassing existing baselines based on data quality and diversity.

📝 Abstract

While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model's internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects ``Feature-Resonant Data'' that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable 17.4% on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.

Problem

Research questions and friction points this paper is trying to address.

interpretability

data selection

large language models

feature activation

model optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretability-Guided Data Selection

Sparse Autoencoders

Feature-Resonant Data