CELL your Model: Contrastive Explanation Methods for Large Language Models

📅 2024-06-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the challenge of explaining generative outputs from large language models (LLMs) under black-box access. We propose the first contrastive explanation method that relies solely on black-box queries: by constructing semantically consistent yet output-degrading or contradictory prompts, it identifies the root causes behind a model’s specific response. Our approach establishes a novel contrastive explanation paradigm tailored to generative tasks, introduces a budget-aware search algorithm to optimize explanation fidelity under query constraints, and incorporates a user-aligned semantic scoring function to enhance interpretability. Evaluated across open-ended text generation, automated red-teaming, and dialogue degradation attribution, our method significantly improves explanation credibility, practical utility, and robustness in long-context settings.

Technology Category

Application Category

📝 Abstract

The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI, such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a scoring function that has meaning to the user and not necessarily a specific real valued quantity (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.

Problem

Research questions and friction points this paper is trying to address.

Explain LLM responses contrastively

Develop budgeted contrastive explanation algorithm

Apply method to open-text generation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive explanation method

Black-box query access

Budgeted algorithm scoring

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models