🤖 AI Summary
Existing sparse autoencoder (SAE) debiasing methods rely on decoder weights for bias mitigation, imposing inherent limitations on representation fairness and interpretability.
Method: We propose S&P TopK, a novel encoder-only debiasing framework that achieves fair representation learning solely through encoder weights—by integrating feature selection, weight orthogonalization, and interpolation-based projection—without decoder involvement or reconstruction objectives. This shifts from reconstruction-driven paradigms to controllable, interpretable, encoder-centric representation optimization.
Contribution/Results: We establish, for the first time, the feasibility of using encoder weights as the sole carrier of fair representations and introduce the first end-to-end, decoder-free SAE debiasing framework. On multiple vision-language tasks, our method improves fairness metrics by up to 3.2× over baselines, achieves 1.8× higher test-time debiasing efficacy than state-of-the-art methods, and strictly preserves downstream task performance.
📝 Abstract
Sparse Autoencoders (SAEs) have proven valuable due to their ability to provide interpretable and steerable representations. Current debiasing methods based on SAEs manipulate these sparse activations presuming that feature representations are housed within decoder weights. We challenge this fundamental assumption and introduce an encoder-focused alternative for representation debiasing, contributing three key findings: (i) we highlight an unconventional SAE feature selection strategy, (ii) we propose a novel SAE debiasing methodology that orthogonalizes input embeddings against encoder weights, and (iii) we establish a performance-preserving mechanism during debiasing through encoder weight interpolation. Our Selection and Projection framework, termed S&P TopK, surpasses conventional SAE usage in fairness metrics by a factor of up to 3.2 and advances state-of-the-art test-time VLM debiasing results by a factor of up to 1.8 while maintaining downstream performance.