A Geometric Unification of Concept Learning with Concept Cones

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work addresses the challenge of evaluating geometric and semantic alignment between concepts learned by sparse autoencoders (SAEs) and human-defined concepts (e.g., in concept bottleneck models, CBMs). We propose the “concept cone” framework—a unified formalism that models interpretable concepts as nonnegative linear combinations of directions in activation space, enabling alignment quantification via cone containment. This geometric formulation unifies supervised and unsupervised concept learning paradigms for the first time, integrating CBMs’ human-annotated priors with SAEs’ sparsity-inducing inductive bias. We derive interpretable, differentiable metrics linking architectural inductive biases to concept interpretability. Empirically, we identify an optimal trade-off between sparsity and expansion factor, substantially improving SAEs’ alignment with human concept systems—both in geometric structure (e.g., directional consistency) and semantic coherence (e.g., concept fidelity and composability).

Technology Category

Application Category

📝 Abstract

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausiblefootnote{We adopt the terminology of citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

Problem

Research questions and friction points this paper is trying to address.

Unifies supervised and unsupervised concept learning via geometric concept cones.

Bridges CBMs and SAEs by evaluating cone containment and alignment.

Quantifies how SAE parameters affect emergence of plausible human concepts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies concept learning via geometric concept cones in activation space

Bridges supervised CBMs and unsupervised SAEs through containment metrics

Identifies optimal sparsity and expansion for semantic alignment

🔎 Similar Papers

Aligned at the Start: Conceptual Groupings in LLM Embeddings