Towards Explainability of SLMs by investigating Token Level Activation

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing attention-based interpretability methods often overemphasize structurally salient but semantically weak tokens—such as punctuation—failing to reveal the true semantic mechanisms of language models. To address this, this work proposes Activation Flow Network (AFN), a lightweight, model-agnostic interpretability framework that leverages the L2 norm of hidden states from the 8th layer of BERT as a measure of semantic salience. Tokens are partitioned into high- and low-activation groups using an empirical interquartile threshold. Experimental results demonstrate that semantically rich tokens consistently occupy the high-activation group and dominate representational evolution, confirming that the 8th layer serves as a critical region where structural and semantic information converge. This approach significantly enhances model transparency and semantic focus.

📝 Abstract

Transformer-based language models such as BERT having 110M+ parameters have revolutionized natural language understanding, yet their internal mechanisms remain largely opaque to researchers and practitioners. Traditional attention-based interpretability methods often emphasize structurally important but semantically weak tokens such as punctuation marks rather than meaningful semantic relationships. This work introduces a lightweight and model-agnostic framework for quantifying token-level representational importance using hidden-state activation strengths at Layer 8 of BERT. The proposed Activation Flow Network (AFN) framework computes Token Activation Strength using the L2 norm of Layer-8 hidden representations, enabling direct ranking of semantically salient tokens. The study further introduces a threshold-based activation bucket formulation that partitions tokens into HIGH-activation and LOW-activation groups using an empirical upper-quartile activation boundary. Experimental observations demonstrate that semantically meaningful content words consistently occupy the HIGH-activation bucket and dominate representational activation shifts, while structurally supportive tokens contribute comparatively less. The results suggest that Layer 8 acts as a critical semantic consolidation zone balancing structural and semantic information processing. By revealing how activation magnitudes concentrate around semantically informative tokens, this work provides an interpretable and computationally efficient alternative to attentioncentric analysis, contributing toward transforming BERT from a "black box" into a more transparent "glass box" model for natural language understanding.

Problem

Research questions and friction points this paper is trying to address.

Explainability

Token-level Activation

Semantic Interpretability

Transformer Models

Black-box Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-level activation

Activation Flow Network

semantic interpretability