SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Concept vectors suffer from limited interpretability due to severe activation overlap between concept-relevant and concept-irrelevant tokens. We observe that, despite substantial overlap in intra- and inter-concept activation distributions, their extreme upper tails consistently encode stable, discriminative semantic signals. To exploit this, we propose SuperActivator—a robust concept detection mechanism grounded in extreme value analysis—to identify tail-activated tokens across modalities (image/text), architectures, and network layers. Integrating concept vector probing with feature attribution, SuperActivator improves F1 scores by up to 14% across multiple models and tasks, significantly enhancing both concept localization accuracy and attribution reliability. Our core contribution is the first systematic identification and utilization of discriminative semantic information residing in the activation distribution’s tail—establishing a novel paradigm for concept-level interpretability in deep neural networks.

Technology Category

Application Category

📝 Abstract

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

Problem

Research questions and friction points this paper is trying to address.

Identifies reliable concept signals in extreme high tail activations

Improves concept detection across modalities and architectures

Enhances feature attributions for interpretable model semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

SuperActivator tokens detect concepts using extreme high tail activations

This mechanism outperforms standard vector-based and prompting methods

It improves feature attributions across modalities and architectures

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings