Sparse Concept Anchoring for Interpretable and Controllable Neural Representations

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the interpretability and controllability challenges in neural representations by proposing a geometric concept anchoring method under extremely low supervision—requiring fewer than 0.1% labeled samples per anchored concept. The approach leverages activation normalization, concept separation regularization, anchor/subspace attraction regularization, and a structured autoencoder to precisely embed target concepts along specific directions or axis-aligned subspaces in the latent space, while allowing other concepts to self-organize orthogonally. It is the first to enable both reversible behavioral steering and permanent deletion of sparse concepts via intervention. Theoretically, it achieves reconstruction error approaching the information-theoretic lower bound. Experiments demonstrate selective attenuation or complete removal of target concepts while preserving orthogonal features intact, attaining reconstruction error at the theoretical optimum.

Technology Category

Application Category

📝 Abstract

We introduce Sparse Concept Anchoring, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (labels for <0.1% of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.

Problem

Research questions and friction points this paper is trying to address.

Biases latent space to position targeted concepts with minimal supervision

Enables reversible behavioral steering and permanent concept removal

Achieves selective attenuation with negligible impact on orthogonal features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Concept Anchoring biases latent space with minimal supervision

Combines activation normalization and regularizers for concept separation

Enables reversible steering and permanent removal via weight ablation

🔎 Similar Papers

No similar papers found.