Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the limited generalizability of existing activation-attribution methods, which are typically confined to self-explanations within a single model and struggle to transfer across heterogeneous architectures. To overcome this, the paper proposes UAV, a universal activation verbalization framework that maps internal activations from diverse models—spanning different architectures and scales—into natural language explanations via a shared frozen decoder coupled with lightweight trainable adapters. UAV achieves, for the first time, cross-model-family and cross-scale semantic alignment of activations, effectively decoupling task performance from semantic faithfulness. The approach enables efficient transfer by fine-tuning only the adapters. Experimental results demonstrate that UAV matches strong self-explanatory baselines across classification, fact retrieval, and summarization tasks, underscoring the critical role of adapters in preserving semantic fidelity.

📝 Abstract

Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.

Problem

Research questions and friction points this paper is trying to address.

activation verbalization

cross-model explanation

hidden representations

universal framework

model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Activation Verbalizer

cross-model explanation

activation verbalization