Do Natural Language Descriptions of Model Activations Convey Privileged Information?

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This paper challenges whether existing “verbalizer LLM”-based activation interpretation methods genuinely uncover the internal representational mechanisms of target large language models (LLMs), or merely reproduce input information or the interpreter’s parametric priors. Method: The authors design controlled experiments and cross-dataset benchmarks to systematically disentangle three confounding factors: input semantics, target-model activation states, and interpreter-model biases. Contribution/Results: Key findings show that mainstream methods pass standard evaluation benchmarks even when target-model activations are entirely ablated—indicating their outputs stem predominantly from the interpreter’s parametric knowledge rather than the target model’s private representations. Consequently, the paper identifies a critical lack of discriminative power in current evaluation datasets and introduces the first dedicated evaluation framework for attribution of explanation provenance. This work provides both methodological reflection and an empirical benchmark for interpretability research.

Technology Category

Application Category

📝 Abstract

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating if activation verbalization reveals model internals

Assessing whether verbalizations reflect target model or verbalizer knowledge

Identifying need for controlled benchmarks in interpretability methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates verbalization methods using controlled experiments

Reveals verbalizer LLM's parametric knowledge dominates descriptions

Proposes targeted benchmarks for meaningful activation insights

🔎 Similar Papers

Get my drift? Catching LLM Task Drift with Activation Deltas