Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

📅 2025-09-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing sparse autoencoder (SAE) debiasing methods rely on decoder weights for bias mitigation, imposing inherent limitations on representation fairness and interpretability. Method: We propose S&P TopK, a novel encoder-only debiasing framework that achieves fair representation learning solely through encoder weights—by integrating feature selection, weight orthogonalization, and interpolation-based projection—without decoder involvement or reconstruction objectives. This shifts from reconstruction-driven paradigms to controllable, interpretable, encoder-centric representation optimization. Contribution/Results: We establish, for the first time, the feasibility of using encoder weights as the sole carrier of fair representations and introduce the first end-to-end, decoder-free SAE debiasing framework. On multiple vision-language tasks, our method improves fairness metrics by up to 3.2× over baselines, achieves 1.8× higher test-time debiasing efficacy than state-of-the-art methods, and strictly preserves downstream task performance.

Technology Category

Application Category

📝 Abstract
Sparse Autoencoders (SAEs) have proven valuable due to their ability to provide interpretable and steerable representations. Current debiasing methods based on SAEs manipulate these sparse activations presuming that feature representations are housed within decoder weights. We challenge this fundamental assumption and introduce an encoder-focused alternative for representation debiasing, contributing three key findings: (i) we highlight an unconventional SAE feature selection strategy, (ii) we propose a novel SAE debiasing methodology that orthogonalizes input embeddings against encoder weights, and (iii) we establish a performance-preserving mechanism during debiasing through encoder weight interpolation. Our Selection and Projection framework, termed S&P TopK, surpasses conventional SAE usage in fairness metrics by a factor of up to 3.2 and advances state-of-the-art test-time VLM debiasing results by a factor of up to 1.8 while maintaining downstream performance.
Problem

Research questions and friction points this paper is trying to address.

Challenges decoder-focused debiasing assumption in sparse autoencoders
Proposes encoder-based method for fairness through feature selection
Introduces weight interpolation to preserve performance during debiasing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-focused debiasing methodology
Orthogonalizes input against encoder weights
Encoder weight interpolation preserves performance
🔎 Similar Papers
No similar papers found.
A
Antonio Bărbălau
Bitdefender, Romania
C
Cristian Daniel Păduraru
Bitdefender, Romania
T
Teodor Poncu
University Politehnica of Bucharest, Romania
A
Alexandru Tifrea
ETH Zurich, Switzerland
Elena Burceanu
Elena Burceanu
Bitdefender
Unsup Video UnderstandingTrackingSegmentationMachine LearningDistribution Shift