🤖 AI Summary
This work addresses the instability and unreliability of attribution explanations in large language models, which often arise from interference by redundant or merely correlated contextual information. To mitigate this issue, the paper proposes RISE, a novel method that introduces, for the first time, a redundancy-insensitive attribution scoring mechanism. RISE leverages conditional information analysis to quantify the unique contribution of each input token relative to the rest of the context, thereby effectively distinguishing essential signals from non-essential yet contextually related redundancies. Experimental results across multiple tasks demonstrate that RISE significantly enhances the stability, robustness, and faithfulness of model attributions, thereby improving the interpretability and monitorability of large language model behaviors.
📝 Abstract
Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.