SPEX: Scaling Feature Interaction Explanations for LLMs

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing interaction attribution methods (e.g., SHAP) scale poorly to long-context inputs (~1000 tokens), limiting interpretability of large language models (LLMs) in realistic settings. Method: We propose the first model-agnostic, scalable interaction attribution framework for inputs up to thousands of tokens. It innovatively integrates natural interaction sparsity priors, sparse Fourier transform, and channel decoding techniques to enable efficient, high-fidelity identification of feature interactions. The framework supports black-box LLMs and multimodal models, enabling attribution for abstract and compositional reasoning. Contribution/Results: On long-context benchmarks, our method achieves 20% higher reconstruction fidelity than baselines. Interaction attributions strongly correlate with human annotations on HotpotQA (Spearman ρ > 0.89). We demonstrate broad applicability across GPT-4o mini and vision-language models, validating generalizability and practical utility.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide marginal feature attributions, while their extensions to interaction importances only scale to small input lengths ($approx 20$). We propose Spectral Explainer (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($approx 1000)$. SPEX exploits underlying natural sparsity among interactions -- common in real-world data -- and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, HotpotQA, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (GPT-4o mini) and compositional reasoning in vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Scales interaction attribution for large inputs
Identifies key features influencing model output
Provides model-agnostic explanations for complex reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Fourier transform
Channel decoding algorithm
Model-agnostic interaction attribution
🔎 Similar Papers
No similar papers found.
J
Justin Singh Kang
Department of Electrical Engineering and Computer Science, UC Berkeley
Landon Butler
Landon Butler
EECS Ph.D. student, University of California, Berkeley
Machine LearningInterpretabilitySignal ProcessingGame Theory
Abhineet Agarwal
Abhineet Agarwal
Statistics PhD, University of California, Berkeley
Large Language ModelsAI ExplainabilityCausal InferenceBandits
Y
Y. E. Erginbas
Department of Electrical Engineering and Computer Science, UC Berkeley
Ramtin Pedarsani
Ramtin Pedarsani
Associate Professor, Electrical and Computer Engineering, UC Santa Barbara
Machine LearningInformation TheoryGame Theory
K
K. Ramchandran
Department of Electrical Engineering and Computer Science, UC Berkeley
B
Bin Yu
Department of Electrical Engineering and Computer Science, UC Berkeley; Department of Statistics, UC Berkeley