🤖 AI Summary
This work addresses the ambiguity in feature attribution for generative language models, which often stems from a lack of explicit semantic grounding. The paper introduces the “attribution contract” framework, systematically identifying that disputes in attribution arise from inconsistent implicit assumptions about explanations. By formally defining the explained output, attributable features, assumptions about the generative process, fixed conditions, and model scoring, the framework standardizes attribution practices. Through case studies on autoregressive and diffusion-based language models, it clarifies the validity boundaries of different attribution settings and underscores that attribution methods must be evaluated in alignment with their corresponding contracts. This approach prevents misleading interpretations and establishes a new paradigm for trustworthy explanations in generative models.
📝 Abstract
Feature attribution methods promise to identify which input features matter for a model output. In generative language models, however, it is often unclear what should count as a feature in the first place. In autoregressive language models, earlier generated tokens are both outputs of the model and inputs to later predictions. In diffusion language models, generation proceeds through iterative denoising or unmasking rather than fixed left-to-right prediction, so local explanation may target a state of diffusion rather than a next token. We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts. Using autoregressive and diffusion language models as case studies, we show when attribution to earlier generated tokens, intermediate states, or denoising stages is informative, when it is misleading, and why feature-attribution methods in generative language models should be evaluated as method-contract pairs.