The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current large language model alignment methods (e.g., RLHF) rely on opaque, black-box parameter tuning, lacking transparency and interpretability. To address this, we propose FSRL—a Feature-level Steering Reinforcement Learning framework for transparent alignment. FSRL employs sparse autoencoders (SAEs) to disentangle interpretable semantic and stylistic features, then integrates lightweight adapters with reinforcement learning to enable targeted intervention on preference optimization objectives. Its core innovation lies in explicitly modeling alignment as controllable editing of sparse, interpretable features—revealing, for the first time, the dominant role of stylistic features in alignment and providing a diagnosable, attribution-aware analytical pathway. Experiments demonstrate that FSRL matches state-of-the-art RLHF in preference optimization performance while substantially enhancing interpretability and controllability of the alignment process.

Technology Category

Application Category

📝 Abstract

Aligning large language models is critical for their usability and safety. However, the prevailing approach of Reinforcement Learning from Human Feedback (RLHF) induces diffuse, opaque parameter changes, making it difficult to discern what the model has internalized. Hence, we introduce Feature Steering with Reinforcement Learning (FSRL), a transparent alignment framework that trains a lightweight adapter to steer behavior by modulating interpretable features from a Sparse Autoencoder (SAE). First, we demonstrate that FSRL is an effective method for preference optimization and is comparable with current RLHF methods. We then perform mechanistic analysis on the trained adapter, and find that its policy systematically promotes style features over explicit alignment concepts, suggesting that the preference optimization process rewards stylistic presentation as a proxy for quality. Ultimately, we hope that FSRL provides a tool for both interpretable model control and diagnosing the internal mechanisms of alignment.

Problem

Research questions and friction points this paper is trying to address.

Decomposing preference optimization via interpretable feature steering

Analyzing stylistic features prioritized over explicit alignment concepts

Providing transparent tools for model control and alignment diagnostics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses lightweight adapter for transparent alignment

Steers sparse autoencoder interpretable features

Replaces RLHF with feature-based optimization

🔎 Similar Papers

Is Free Self-Alignment Possible?