Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of existing input attribution methods, which are typically model-agnostic and struggle to deliver high-fidelity explanations for decoder-only large language models. To this end, the authors propose Grad-ELLM, the first gradient-based attribution method specifically designed for decoder-only Transformers. Grad-ELLM generates step-wise saliency maps by integrating the importance of attention-layer gradient channels with the spatial significance of attention maps, all without requiring any architectural modifications to the model. Additionally, the study introduces π-Soft-NC and π-Soft-NS, two more equitable fidelity evaluation metrics. Experimental results across sentiment classification, question answering, and open-ended generation tasks demonstrate that Grad-ELLM significantly outperforms current approaches, achieving higher explanation fidelity.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $\pi$-Soft-NC and $\pi$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.

Problem

Research questions and friction points this paper is trying to address.

input attribution

faithfulness

decoder-only LLMs

transformer architecture

explainability

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient-based attribution

decoder-only LLMs

faithfulness metrics