🤖 AI Summary
Existing interpretability methods for large language models (LLMs) in text classification struggle with black-box, high-query-cost LLM APIs, where internal gradients or attention mechanisms are inaccessible.
Method: This paper proposes a counterfactual reasoning–based keyword identification method centered on the Decision Change Rate (DCR) metric—a quantitative framework that systematically substitutes input tokens and measures resultant label shifts to estimate each token’s causal influence on model predictions. Unlike gradient- or attention-based white-box approaches, DCR requires only API-level access and imposes no assumptions about model architecture or parameter availability.
Contribution/Results: Evaluated across multiple text classification benchmarks, DCR achieves an average 12.7% improvement in keyword identification accuracy over baselines and demonstrates strong cross-model generalization. It establishes a novel, efficient, and non-intrusive paradigm for post-hoc interpretability analysis in resource-constrained, black-box LLM settings.
📝 Abstract
Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs' decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM's ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.