Measuring and Guiding Monosemanticity

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing sparse autoencoders (SAEs) in large language models (LLMs) suffer from unreliable feature isolation and poor semantic monosemanticity, limiting model interpretability and controllability. To address this, we propose the Feature Monosemanticity Score (FMS), a quantitative metric for assessing semantic purity, and introduce Guided Sparse Autoencoders (G-SAEs)—a novel framework that incorporates supervised concept annotations during conditional training to explicitly optimize semantic disentanglement in the latent space. Compared to baseline SAEs, G-SAEs significantly improve both monosemanticity and concept localization accuracy. Empirically, on toxicity detection, writing style identification, and privacy attribute recognition tasks, G-SAEs enable precise, interpretable, and controllable semantic interventions without degrading generation quality. These results demonstrate the framework’s effectiveness and generalizability for controllable representation learning in LLMs.

Technology Category

Application Category

📝 Abstract

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Quantify feature monosemanticity in latent representation

Improve interpretability and control of large language models

Enhance feature isolation and disentanglement in sparse autoencoders

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Feature Monosemanticity Score (FMS)

Proposes Guided Sparse Autoencoders (G-SAE)

Enhances interpretability and control of LLMs

🔎 Similar Papers

No similar papers found.