🤖 AI Summary
This study addresses a critical confound in interpreting superposition in neural representations: existing metrics misattribute activation overlap between homonyms—such as “bank” denoting either a financial institution or a riverside—to genuine concept compression, conflating lexical form with semantic distinction. Through a systematic 2×2 factorial design that disentangles lexical identity from semantic content, the work demonstrates for the first time that polysemy, rather than true superposition, is the primary driver of observed activation overlap. Experiments across models ranging from 110M to 70B parameters reveal that 18–36% of sparse autoencoder features conflate distinct word senses, predominantly within the top ≤1% most active dimensions. Removing this confound significantly improves word-sense disambiguation performance and enhances the selectivity of knowledge editing interventions (p=0.002).
📝 Abstract
If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).