From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the challenge of effectively translating interpretable features uncovered by internal interpretability tools—such as sparse autoencoders—into performance-enhancing training strategies. To this end, it introduces Interpretability-Guided Data Selection (IGDS), a novel framework that establishes, for the first time, a direct pathway from internal interpretable features to data selection. IGDS identifies task-relevant internal features through frequency tracing and intervention filtering, then selects “feature-resonant” data that strongly activates these features for fine-tuning. Experiments on Gemma-2, LLaMA-3.1, and Qwen3 demonstrate that IGDS achieves superior performance using only 50% of the training data: on mathematical reasoning tasks, Gemma-2-2B fine-tuned with IGDS outperforms full-data fine-tuning by 17.4%, significantly surpassing existing baselines based on data quality and diversity.
📝 Abstract
While mechanistic interpretability tools like Sparse Autoencoders (SAEs) can uncover meaningful features within Large Language Models (LLMs), a critical gap remains in transforming these insights into practical actions for model optimization. We bridge this gap with the hypothesis that data selection guided by a model's internal task features is a effective training strategy. Inspired by this, we propose Interpretability-Guided Data Selection (IGDS), a framework that first identifies these causal task features through frequency recall and interventional filtering, then selects ``Feature-Resonant Data'' that maximally activates task features for fine-tuning. We validate IGDS on mathematical reasoning, summarization, and translation tasks within Gemma-2, LLaMA-3.1, and Qwen3 models. Our experiments demonstrate exceptional data efficiency: on the Math task, IGDS surpasses full-dataset fine-tuning by a remarkable 17.4% on Gemma-2-2B while using only 50% of the data, and outperforms established baselines focused on data quality and diversity. Analysis confirms a strong positive correlation between feature amplification and task performance improvement. IGDS thus provides a direct and effective framework to enhance LLMs by leveraging their internal mechanisms, validating our core hypothesis.
Problem

Research questions and friction points this paper is trying to address.

interpretability
data selection
large language models
feature activation
model optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interpretability-Guided Data Selection
Sparse Autoencoders
Feature-Resonant Data
Mechanistic Interpretability
Data Efficiency
🔎 Similar Papers