🤖 AI Summary
How language models internally represent and retrieve factual knowledge about entities remains poorly understood. This work addresses this gap by identifying MLP neurons sensitive to specific entities through templated prompting and systematically validating their functional role across multiple language models using causal interventions, negative ablation, and controlled injection techniques. The study reveals the existence of sparse, causally manipulable “entity units”—individual neurons in early layers capable of supporting compact entity retrieval and enabling canonical resolution across languages and aliases. Experiments on 200 PopQA entities demonstrate that activating a single such neuron consistently recovers accurate predictions, significantly outperforming baseline methods, with particularly pronounced gains for popular entities.
📝 Abstract
Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.