AlignSAE: Concept-Aligned Sparse Autoencoders

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Large language models (LLMs) encode implicit knowledge in dense, high-dimensional parameter spaces that are inherently opaque and difficult to interpret or intervene upon. While sparse autoencoders (SAEs) can extract fine-grained features, they lack reliable alignment with human-interpretable concept ontologies, resulting in semantic entanglement. Method: We propose a “pretraining + post-training” curriculum learning framework: first, unsupervised pretraining to induce sparse representations; then, supervised post-training that explicitly binds SAE latent units to structured concept ontologies, achieving semantic disentanglement. Contribution/Results: This is the first method to establish verifiable, intervention-friendly concept-aligned sparse coding within LLM hidden layers. Experiments demonstrate that manipulating a single aligned latent variable enables high-fidelity, causally robust semantic edits—such as entity or attribute replacement—without compromising output quality. The approach significantly enhances both the interpretability and controllability of LLM knowledge.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a defined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots.

Problem

Research questions and friction points this paper is trying to address.

Aligns SAE features with human-defined concepts

Reduces entanglement in feature representations

Enables precise causal interventions in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns SAE features with ontology via curriculum training

Uses supervised post-training to bind concepts to latent slots

Enables precise causal interventions through semantically aligned slots

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models