Cross-Layer Discrete Concept Discovery for Interpreting Language Models

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete concepts emerging across Transformer layers in large language models are difficult to identify due to residual stream-induced linear mixing and cross-layer redundancy, rendering conventional single-layer analysis insufficient for effective disentanglement. Method: We propose CLVQVAE, the first framework to incorporate vector quantization into cross-layer representation modeling. It employs scaled spherical k-means++ initialization to preserve directional sensitivity, combined with top-k temperature sampling and exponential moving average (EMA)-based codebook updates, enabling structured disentanglement and semantic aggregation within the residual stream. Results: Experiments demonstrate that CLVQVAE stably extracts compact, semantically coherent, and interpretable concept vectors across multiple latent layers. It significantly enhances interpretability of internal model representations and enables robust cross-layer concept tracking—outperforming prior methods in both fidelity and explainability.

Technology Category

Application Category

📝 Abstract
Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose gls{clvqvae}, a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-$k$ temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.
Problem

Research questions and friction points this paper is trying to address.

Uncover emergent concepts across transformer layers
Address cross-layer superposition and feature redundancy
Map representations to compact interpretable concept vectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vector quantization for cross-layer concept mapping
Top-k temperature sampling with EMA updates
Scaled-spherical k-means++ for semantic alignment
🔎 Similar Papers
No similar papers found.