Because we have LLMs, we Can and Should Pursue Agentic Interpretability

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current LLM interpretability frameworks suffer from an “human-in-the-loop embedding” dilemma, wherein static, one-off explanations fail to account for evolving user cognition, leading to misaligned evaluation and design. Method: We propose *agentic interpretability*—a new paradigm that treats the LLM as an active pedagogical agent: it dynamically models the user’s cognitive state through multi-turn interactive dialogue and generates adaptive, concept-distilled explanations to co-construct deep understanding. Contribution/Results: We formally define this paradigm, emphasizing the LLM’s agency, theory-of-mind modeling capability, and collaborative teaching function; we introduce a dialogue-agent architecture and a human-AI co-evaluation framework. Empirical results demonstrate significant improvements in both efficiency and accuracy of human comprehension of superhuman LLM concepts. This work establishes a scalable theoretical and technical foundation for high-trust human-AI collaboration.

Technology Category

Application Category

📝 Abstract

The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to discover potentially superhuman concepts that can improve humans' mental model of machines. Agentic interpretability introduces challenges, particularly in evaluation, due to what we call `human-entangled-in-the-loop' nature (humans responses are integral part of the algorithm), making the design and evaluation difficult. We discuss possible solutions and proxy goals. As LLMs approach human parity in many tasks, agentic interpretability's promise is to help humans learn the potentially superhuman concepts of the LLMs, rather than see us fall increasingly far from understanding them.

Problem

Research questions and friction points this paper is trying to address.

Develop interactive LLM interpretability via multi-turn conversations

Enhance human understanding of superhuman LLM concepts

Address evaluation challenges in human-entangled interpretability methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM proactively assists human understanding

Multi-turn conversation for interpretability

Human-entangled-in-the-loop evaluation challenges

🔎 Similar Papers

No similar papers found.