🤖 AI Summary
This work addresses the limited generalizability of existing activation-attribution methods, which are typically confined to self-explanations within a single model and struggle to transfer across heterogeneous architectures. To overcome this, the paper proposes UAV, a universal activation verbalization framework that maps internal activations from diverse models—spanning different architectures and scales—into natural language explanations via a shared frozen decoder coupled with lightweight trainable adapters. UAV achieves, for the first time, cross-model-family and cross-scale semantic alignment of activations, effectively decoupling task performance from semantic faithfulness. The approach enables efficient transfer by fine-tuning only the adapters. Experimental results demonstrate that UAV matches strong self-explanatory baselines across classification, fact retrieval, and summarization tasks, underscoring the critical role of adapters in preserving semantic fidelity.
📝 Abstract
Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.