🤖 AI Summary
This study investigates how sparse coding organizes the representational structure of language model activation vectors and uncovers its intrinsic links to feature disentanglement and reconstruction fidelity.
Method: We propose SAEMA to empirically validate representational hierarchy; formally define local and global representations; establish a causal relationship between their separability and reconstruction quality; and reinterpret sparsity principles from a geometric perspective. Technically, we integrate rank analysis of symmetric positive semi-definite matrices, modal tensor decomposition, noise-robustness evaluation, optimization-driven representation intervention, and joint modeling of sparse coding and feature merging.
Contributions/Results: Empirical results demonstrate that sparse coding not only enhances feature discriminability but also introduces orthogonal redundant dimensions; crucially, representation separability—rather than sparsity alone—is the decisive factor governing reconstruction performance. These findings provide both theoretical foundations and empirical evidence for representation disentanglement and tool design in interpretable AI.
📝 Abstract
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.