LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Auditing the interpretability of black-box large language models (LLMs) remains challenging due to inaccessibility of gradients, logits, or internal activations. Method: We propose LAMP, a lightweight probing method that constructs local linear surrogate models solely from self-generated explanations—without requiring model internals—by treating explanations as a coordinate system and performing local linear regression to quantify each explanation factor’s actual influence on predictions and assess explanation–prediction consistency. Contribution/Results: Empirical evaluation across sentiment analysis, controversial topic detection, and safety prompt auditing demonstrates significant local linearity of LLM decision boundaries in explanation space. The fitted decision surfaces strongly correlate with human explanation quality ratings and clinical expert judgments (p < 0.01), establishing a novel, gradient-free paradigm for trustworthy black-box LLM auditing.

Technology Category

Application Category

📝 Abstract
We introduce extbf{LAMP} ( extbf{L}inear extbf{A}ttribution extbf{M}apping extbf{P}robe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its predictions through a locally linear model approximating the decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals which stated factors steer the model's decisions, and by how much. We apply LAMP to three tasks: extit{sentiment analysis}, extit{controversial-topic detection}, and extit{safety-prompt auditing}. Across these tasks, LAMP reveals that many LLMs exhibit locally linear decision landscapes. In addition, these surfaces correlate with human judgments on explanation quality and, on a clinical case-file data set, aligns with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model behaves consistently with the explanations it provides.
Problem

Research questions and friction points this paper is trying to address.

Extracts locally linear decision surfaces from LLMs
Links model's self-reported explanations to predictions
Audits proprietary models without internal access
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts locally linear decision surfaces from LLMs
Uses self-reported explanations as a coordinate system
Operates without model gradients or internal activations
🔎 Similar Papers
No similar papers found.
Ryan Chen
Ryan Chen
Northwestern University
Youngmin Ko
Youngmin Ko
Department of Statistics and Data Science, Northwestern University
Z
Zeyu Zhang
Department of Statistics and Data Science, Northwestern University
C
Catherine Cho
Department of Statistics and Data Science, Northwestern University
Sunny Chung
Sunny Chung
Yale School of Medicine
implementation sciencegastroenterologyartificial intelligence
M
Mauro Giuffr'e
Section of Digestive Diseases, Yale School of Medicine
Dennis L. Shung
Dennis L. Shung
Yale University School of Medicine
B
Bradly C. Stadie
Department of Statistics and Data Science, Northwestern University