🤖 AI Summary
Model-agnostic interpretability methods for black-box large language models (LLMs) incur prohibitive costs due to frequent API calls.
Method: This paper proposes a budget-aware surrogate-model-driven explanation framework that requires no access to the target LLM’s internal parameters; instead, it leverages low-cost surrogate models to generate high-fidelity explanations.
Contribution/Results: We empirically demonstrate—for the first time—that surrogate models can faithfully substitute original LLMs in generating accurate, faithful explanations, and systematically validate their generalization to downstream tasks such as reasoning diagnostics. Experiments across multiple mainstream LLMs show that our approach reduces API call costs by 60–85% while preserving explanation fidelity and maintaining ≥90% of the original performance on downstream tasks. This work establishes a new paradigm for budget-conscious, model-agnostic LLM interpretability.
📝 Abstract
With Large language models (LLMs) becoming increasingly prevalent in various applications, the need for interpreting their predictions has become a critical challenge. As LLMs vary in architecture and some are closed-sourced, model-agnostic techniques show great promise without requiring access to the model's internal parameters. However, existing model-agnostic techniques need to invoke LLMs many times to gain sufficient samples for generating faithful explanations, which leads to high economic costs. In this paper, we show that it is practical to generate faithful explanations for large-scale LLMs by sampling from some budget-friendly models through a series of empirical studies. Moreover, we show that such proxy explanations also perform well on downstream tasks. Our analysis provides a new paradigm of model-agnostic explanation methods for LLMs, by including information from budget-friendly models.