Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models

📅 2025-06-22

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Understanding how large language models (LLMs) organize semantic knowledge and exert causal control over outputs remains challenging due to their opacity and distributed representations. Method: We propose a sparse autoencoder (SAE)-based feature co-activation analysis to identify semantically coherent, context-stable neural modules; integrate causal attribution, adversarial amplification, and ablation experiments to validate their functional roles across layers. Contribution/Results: We establish a hierarchical modular architecture wherein early layers encode concrete concepts (e.g., “France”) and deeper layers encode abstract relations (e.g., “capital-of”). For the first time, we demonstrate predictable, cross-layer compositional interventions—e.g., swapping country–relation pairs—to generate controllable counterfactual outputs. Our approach provides an interpretable, operationally grounded paradigm for modeling internal knowledge organization and enabling causal, fine-grained output steering in LLMs.

Technology Category

Application Category

📝 Abstract

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on country-relation tasks, we show that ablating semantic components for countries and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and country components yields compound counterfactual outputs. We find that, whereas most country components emerge from the very first layer, the more abstract relation components are concentrated in later layers. Furthermore, within relation components themselves, nodes from later layers tend to have a stronger causal impact on model outputs. Overall, these findings suggest a modular organization of knowledge within LLMs and advance methods for efficient, targeted model manipulation.

Problem

Research questions and friction points this paper is trying to address.

Identifies semantic modules in LLMs using sparse feature coactivation

Demonstrates predictable output changes via component ablation and amplification

Reveals layer-specific distribution of country and relation components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoder features for semantic analysis

Ablates and amplifies components for predictable changes

Identifies modular knowledge organization in LLMs

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models