Do Natural Language Descriptions of Model Activations Convey Privileged Information?

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper challenges whether existing “verbalizer LLM”-based activation interpretation methods genuinely uncover the internal representational mechanisms of target large language models (LLMs), or merely reproduce input information or the interpreter’s parametric priors. Method: The authors design controlled experiments and cross-dataset benchmarks to systematically disentangle three confounding factors: input semantics, target-model activation states, and interpreter-model biases. Contribution/Results: Key findings show that mainstream methods pass standard evaluation benchmarks even when target-model activations are entirely ablated—indicating their outputs stem predominantly from the interpreter’s parametric knowledge rather than the target model’s private representations. Consequently, the paper identifies a critical lack of discriminative power in current evaluation datasets and introduces the first dedicated evaluation framework for attribution of explanation provenance. This work provides both methodological reflection and an empirical benchmark for interpretability research.

Technology Category

Application Category

📝 Abstract
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating if activation verbalization reveals model internals
Assessing whether verbalizations reflect target model or verbalizer knowledge
Identifying need for controlled benchmarks in interpretability methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates verbalization methods using controlled experiments
Reveals verbalizer LLM's parametric knowledge dominates descriptions
Proposes targeted benchmarks for meaningful activation insights
🔎 Similar Papers
No similar papers found.
Millicent Li
Millicent Li
Northeastern University
natural language processinghuman-computer interactioninterpretability
A
Alberto Mario Ceballos Arroyo
Northeastern University
G
Giordano Rogers
Northeastern University
N
Naomi Saphra
Kempner Institute, Harvard University
B
Byron C. Wallace
Northeastern University