Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

166K/year
🤖 AI Summary
This work addresses the challenge of fine-grained, token-level influence attribution when deploying large language models in high-stakes domains such as healthcare. The authors propose a general framework based on latent mediators: sparse autoencoders are inserted at arbitrary layers of a pretrained model to learn approximately independent latent feature bases. By combining Jacobian-vector products with inverse Hessian approximations, the method efficiently propagates influence from the latent space back to input tokens. Unlike conventional influence functions that assume token independence, this approach enables non-additive, joint token-level attribution in non-autoregressive settings for the first time. Experiments on medical benchmarks demonstrate that the framework identifies sparse, interpretable sets of critical tokens, substantially enhancing model transparency and auditability, thereby offering a practical tool for trustworthy AI.
📝 Abstract
A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.
Problem

Research questions and friction points this paper is trying to address.

influence attribution
token-level precision
large language models
latent features
model interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

influence attribution
latent mediation
sparse autoencoders
token-level interpretability
orthogonal latent spaces
🔎 Similar Papers
No similar papers found.