🤖 AI Summary
This paper challenges whether existing “verbalizer LLM”-based activation interpretation methods genuinely uncover the internal representational mechanisms of target large language models (LLMs), or merely reproduce input information or the interpreter’s parametric priors.
Method: The authors design controlled experiments and cross-dataset benchmarks to systematically disentangle three confounding factors: input semantics, target-model activation states, and interpreter-model biases.
Contribution/Results: Key findings show that mainstream methods pass standard evaluation benchmarks even when target-model activations are entirely ablated—indicating their outputs stem predominantly from the interpreter’s parametric knowledge rather than the target model’s private representations. Consequently, the paper identifies a critical lack of discriminative power in current evaluation datasets and introduces the first dedicated evaluation framework for attribution of explanation provenance. This work provides both methodological reflection and an empirical benchmark for interpretability research.
📝 Abstract
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they succeed at benchmarks without any access to target model internals, suggesting that these datasets are not ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the activations of the target LLM being decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.