Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

📅 2025-09-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing sparse autoencoder (SAE) debiasing methods rely on decoder weights for bias mitigation, imposing inherent limitations on representation fairness and interpretability. Method: We propose S&P TopK, a novel encoder-only debiasing framework that achieves fair representation learning solely through encoder weights—by integrating feature selection, weight orthogonalization, and interpolation-based projection—without decoder involvement or reconstruction objectives. This shifts from reconstruction-driven paradigms to controllable, interpretable, encoder-centric representation optimization. Contribution/Results: We establish, for the first time, the feasibility of using encoder weights as the sole carrier of fair representations and introduce the first end-to-end, decoder-free SAE debiasing framework. On multiple vision-language tasks, our method improves fairness metrics by up to 3.2× over baselines, achieves 1.8× higher test-time debiasing efficacy than state-of-the-art methods, and strictly preserves downstream task performance.

Technology Category

Application Category

📝 Abstract

Sparse Autoencoders (SAEs) have proven valuable due to their ability to provide interpretable and steerable representations. Current debiasing methods based on SAEs manipulate these sparse activations presuming that feature representations are housed within decoder weights. We challenge this fundamental assumption and introduce an encoder-focused alternative for representation debiasing, contributing three key findings: (i) we highlight an unconventional SAE feature selection strategy, (ii) we propose a novel SAE debiasing methodology that orthogonalizes input embeddings against encoder weights, and (iii) we establish a performance-preserving mechanism during debiasing through encoder weight interpolation. Our Selection and Projection framework, termed S&P TopK, surpasses conventional SAE usage in fairness metrics by a factor of up to 3.2 and advances state-of-the-art test-time VLM debiasing results by a factor of up to 1.8 while maintaining downstream performance.

Problem

Research questions and friction points this paper is trying to address.

Challenges decoder-focused debiasing assumption in sparse autoencoders

Proposes encoder-based method for fairness through feature selection

Introduces weight interpolation to preserve performance during debiasing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-focused debiasing methodology

Orthogonalizes input against encoder weights

Encoder weight interpolation preserves performance

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings