🤖 AI Summary
To address the challenge of attribution explanation for large language models (LLMs) in context-aware generative tasks such as summarization and question answering, this paper proposes MExGen—a novel multi-level perturbation-based interpretability framework tailored for generative models. It introduces a general text-to-scalar quantification mechanism (e.g., ROUGE, BERTScore) to map discrete model outputs into attributable scalar scores, and designs a linear-query-complexity algorithm enabling efficient decomposition of long inputs. By integrating hierarchical masking, perturbation analysis, and extensions of LIME/SHAP—alongside dual-track automated and human evaluation—MExGen achieves significantly higher local fidelity than baseline methods. Empirical results demonstrate that multi-level explanations exhibit greater stability, improved readability, and higher human trustworthiness compared to single-level alternatives.
📝 Abstract
Perturbation-based explanation methods such as LIME and SHAP are commonly applied to text classification. This work focuses on their extension to generative language models. To address the challenges of text as output and long text inputs, we propose a general framework called MExGen that can be instantiated with different attribution algorithms. To handle text output, we introduce the notion of scalarizers for mapping text to real numbers and investigate multiple possibilities. To handle long inputs, we take a multi-level approach, proceeding from coarser levels of granularity to finer ones, and focus on algorithms with linear scaling in model queries. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and context-grounded question answering. The results show that our framework can provide more locally faithful explanations of generated outputs.