Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Unsupervised disentanglement methods (e.g., sparse autoencoders) struggle to enable concept-level identifiable manipulation in LLM embedding spaces due to latent dimension entanglement, leading to cross-concept interference. To address this, we propose the Sparse Shift Autoencoder (SSAE), the first method to model embedding *differences*—rather than raw embeddings—enabling precise, identifiable, single-concept intervention using only pairwise observations of multi-concept shifts. SSAE overcomes the fundamental theoretical limitation of conventional sparse autoencoders: their inability to ensure concept identifiability under purely unsupervised settings—without contrastive labels or model fine-tuning. We validate SSAE on both semi-synthetic and real-world language data, demonstrating high-precision single-concept controllable generation. Our approach significantly enhances interpretability and alignment efficiency in LLM representation manipulation.

Technology Category

Application Category

📝 Abstract

Steering methods manipulate the representations of large language models (LLMs) to induce responses that have desired properties, e.g., truthfulness, offering a promising approach for LLM alignment without the need for fine-tuning. Traditionally, steering has relied on supervision, such as from contrastive pairs of prompts that vary in a single target concept, which is costly to obtain and limits the speed of steering research. An appealing alternative is to use unsupervised approaches such as sparse autoencoders (SAEs) to map LLM embeddings to sparse representations that capture human-interpretable concepts. However, without further assumptions, SAEs may not be identifiable: they could learn latent dimensions that entangle multiple concepts, leading to unintentional steering of unrelated properties. We introduce Sparse Shift Autoencoders (SSAEs) that instead map the differences between embeddings to sparse representations. Crucially, we show that SSAEs are identifiable from paired observations that vary in extit{multiple unknown concepts}, leading to accurate steering of single concepts without the need for supervision. We empirically demonstrate accurate steering across semi-synthetic and real-world language datasets using Llama-3.1 embeddings.

Problem

Research questions and friction points this paper is trying to address.

Identifiable steering of multi-concept shifts

Sparse autoencoding for LLM embeddings

Unsupervised accurate steering without supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Shift Autoencoders for steering

Identifiable from multiple unknown concepts

Accurate steering without supervision

🔎 Similar Papers

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference