🤖 AI Summary
Auditing the interpretability of black-box large language models (LLMs) remains challenging due to inaccessibility of gradients, logits, or internal activations.
Method: We propose LAMP, a lightweight probing method that constructs local linear surrogate models solely from self-generated explanations—without requiring model internals—by treating explanations as a coordinate system and performing local linear regression to quantify each explanation factor’s actual influence on predictions and assess explanation–prediction consistency.
Contribution/Results: Empirical evaluation across sentiment analysis, controversial topic detection, and safety prompt auditing demonstrates significant local linearity of LLM decision boundaries in explanation space. The fitted decision surfaces strongly correlate with human explanation quality ratings and clinical expert judgments (p < 0.01), establishing a novel, gradient-free paradigm for trustworthy black-box LLM auditing.
📝 Abstract
We introduce extbf{LAMP} ( extbf{L}inear extbf{A}ttribution extbf{M}apping extbf{P}robe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its predictions through a locally linear model approximating the decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals which stated factors steer the model's decisions, and by how much. We apply LAMP to three tasks: extit{sentiment analysis}, extit{controversial-topic detection}, and extit{safety-prompt auditing}. Across these tasks, LAMP reveals that many LLMs exhibit locally linear decision landscapes. In addition, these surfaces correlate with human judgments on explanation quality and, on a clinical case-file data set, aligns with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model behaves consistently with the explanations it provides.