From superposition to sparse codes: interpretable representations in neural networks

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This study addresses the linear superposition mechanism of concept representations in neural networks and the extraction of interpretable concepts. We propose a three-step framework: (1) theoretically proving that hidden-layer features under classification training are linearly recoverable; (2) disentangling semantically distinct, interpretable concepts from neural activations via sparse coding (ISTA with ℓ₁ regularization); and (3) establishing a unified quantitative evaluation framework integrating identifiability theory, compressed sensing principles, and causal saliency analysis. Our work is the first to systematically bridge superposed representations and sparse interpretability, unifying tools from multiple disciplines. Empirically, the method achieves high-precision semantic disentanglement across multilayer networks, improving concept matching accuracy by 42%. This provides a novel paradigm for AI interpretability and neural coding theory—rigorously grounded in both theoretical analysis and empirical validation.

Technology Category

Application Category

📝 Abstract

Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.

Problem

Research questions and friction points this paper is trying to address.

Understanding neural network information representation in neuroscience and AI.

Extracting interpretable features from neural activations using sparse coding.

Developing metrics to assess interpretability of neural network features.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifiability theory for latent feature recovery

Sparse coding for disentangled feature extraction

Quantitative metrics for interpretability assessment

🔎 Similar Papers

Closed-Form Interpretation of Neural Network Latent Spaces with Symbolic Gradients