ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM feature interaction attribution methods (e.g., SPEX) rely on exhaustive combinatorial enumeration or massive inference, rendering them infeasible for high-dimensional inputs (e.g., thousands of features) and prohibitively expensive computationally. This work introduces the first hierarchical modeling framework for LLM feature interactions, proposing an efficient attribution method based on a gradient-boosted tree surrogate model. It jointly analyzes high-order dependencies across layers—specifically between attention heads and training samples—via sparse interaction extraction and information-theoretic attribution. The approach drastically reduces inference overhead: it cuts model calls by 10× compared to SPEX, improves reconstruction fidelity by 20%, boosts feature identification accuracy by over 20%, and enables more aggressive and robust attention head pruning. By unifying scalability with high attribution precision, this framework establishes a new paradigm for interpretable large language models.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often hierarchical -- higher-order interactions are accompanied by their lower-order subsets -- which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over marginal attribution approaches while using $10 imes$ fewer inferences than SPEX. By accounting for interactions, ProxySPEX identifies features that influence model output over 20% more than those selected by marginal approaches. Further, we apply ProxySPEX to two interpretability tasks. Data attribution, where we identify interactions among CIFAR-10 training samples that influence test predictions, and mechanistic interpretability, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. ProxySPEX identifies interactions that enable more aggressive pruning of heads than marginal approaches.
Problem

Research questions and friction points this paper is trying to address.

Efficiently identifying hierarchical feature interactions in LLMs
Reducing inference costs for interpretability in large models
Improving accuracy of feature attribution in high-dimensional datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gradient boosted trees for interaction attribution
Exploits hierarchical feature interactions in LLMs
Reduces inference count by 10 times
🔎 Similar Papers
No similar papers found.