LAMP: Extracting Locally Linear Decision Surfaces from LLM World Models

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Auditing the interpretability of black-box large language models (LLMs) remains challenging due to inaccessibility of gradients, logits, or internal activations. Method: We propose LAMP, a lightweight probing method that constructs local linear surrogate models solely from self-generated explanations—without requiring model internals—by treating explanations as a coordinate system and performing local linear regression to quantify each explanation factor’s actual influence on predictions and assess explanation–prediction consistency. Contribution/Results: Empirical evaluation across sentiment analysis, controversial topic detection, and safety prompt auditing demonstrates significant local linearity of LLM decision boundaries in explanation space. The fitted decision surfaces strongly correlate with human explanation quality ratings and clinical expert judgments (p < 0.01), establishing a novel, gradient-free paradigm for trustworthy black-box LLM auditing.

Technology Category

Application Category

📝 Abstract

We introduce extbf{LAMP} ( extbf{L}inear extbf{A}ttribution extbf{M}apping extbf{P}robe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its predictions through a locally linear model approximating the decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals which stated factors steer the model's decisions, and by how much. We apply LAMP to three tasks: extit{sentiment analysis}, extit{controversial-topic detection}, and extit{safety-prompt auditing}. Across these tasks, LAMP reveals that many LLMs exhibit locally linear decision landscapes. In addition, these surfaces correlate with human judgments on explanation quality and, on a clinical case-file data set, aligns with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model behaves consistently with the explanations it provides.

Problem

Research questions and friction points this paper is trying to address.

Extracts locally linear decision surfaces from LLMs

Links model's self-reported explanations to predictions

Audits proprietary models without internal access

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts locally linear decision surfaces from LLMs

Uses self-reported explanations as a coordinate system

Operates without model gradients or internal activations

🔎 Similar Papers

No similar papers found.