Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM activation interpretation methods rely heavily on complex, task-specific techniques, limiting generalizability and interpretability across models and tasks. Method: This paper proposes a universal evaluation framework for “Activation Oracles” (AOs) trained via LatentQA, systematically investigating their out-of-distribution generalization and the impact of training data diversity. We introduce the first unified natural language interpreter supporting both black-box and white-box activation semantic parsing, trained on diversified LatentQA data—including classification tasks and self-supervised contextual prediction. Contribution/Results: AOs match or surpass state-of-the-art white-box methods across four downstream explanation tasks—achieving SOTA on three. Crucially, we empirically demonstrate, for the first time, that AOs can generalize to unseen models to recover latent knowledge (e.g., biographical facts, malicious intent), breaking beyond narrow, task-bound explanation paradigms and enabling model-agnostic interpretability.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
Problem

Research questions and friction points this paper is trying to address.

Develop general-purpose models to interpret LLM activations via natural language
Evaluate performance of Activation Oracles in out-of-distribution settings
Assess how training data diversity improves activation explanation capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training LLMs to interpret activations via natural language queries
Evaluating generalization with diverse out-of-distribution tasks
Using diversified training datasets to enhance activation explanation performance
🔎 Similar Papers
No similar papers found.