Superposition disentanglement of neural representations reveals hidden alignment

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Superposition—where multiple features are encoded via heterogeneous linear combinations in neural representations—distorts cross-model and cross-species representational alignment metrics, causing conventional methods (e.g., linear regression, soft alignment) to underestimate true alignment. Method: The authors identify superposition-induced structural mismatch as a key mechanism behind degraded alignment scores and propose the first application of sparse autoencoders (SAEs) to disentangle superposed representations, thereby recovering latent alignment obscured by overlapping feature encodings. Contribution/Results: Evaluated on synthetic models, vision DNNs, and neural recordings, SAE-based disentanglement consistently yields significantly higher alignment scores, revealing more accurate cross-system representational correspondences. This work establishes a novel theoretical framework for analyzing representational alignment under superposition and provides a scalable, principled methodology—enhancing reliability in model-to-brain and inter-model alignment studies.

Technology Category

Application Category

📝 Abstract

The superposition hypothesis states that a single neuron within a population may participate in the representation of multiple features in order for the population to represent more features than the number of neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: extit{does superposition interact with alignment metrics in any undesirable way?} We hypothesize that models which represent the same features in extit{different superposition arrangements}, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We first develop a theory for how the strict permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model's base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN( ightarrow)DNN and DNN( ightarrow)brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural codes.

Problem

Research questions and friction points this paper is trying to address.

Investigating superposition's interference with neural alignment metrics

Testing if different superposition arrangements reduce measured alignment

Developing disentanglement methods to reveal true representational similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using sparse autoencoders to disentangle superposition

Replacing base neurons with sparse latent codes

Increasing alignment scores via superposition disentanglement

🔎 Similar Papers

Brain-aligning of semantic vectors improves neural decoding of visual stimuli

2024-03-22Citations: 0

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, AI Alignment