Interpretable Reward Model via Sparse Autoencoder

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Traditional reward models (RMs) in RLHF suffer from poor interpretability and limited adaptability to human preferences; existing multi-dimensional RMs still struggle with feature-level attribution and rely heavily on costly manual annotations. To address this, we propose the Sparse Autoencoder-Enhanced Reward Model (SARM), the first RM architecture integrating a pre-trained sparse autoencoder to map LLM hidden-layer activations into an interpretable, sparse, and unambiguous feature space. This enables fine-grained reward attribution and precise alignment with human preferences. Crucially, SARM dynamically disentangles reward components without additional labeling, supporting real-time feature attribution and preference transfer. Experiments across multiple benchmarks demonstrate that SARM significantly outperforms state-of-the-art baselines while achieving superior transparency, flexibility, and generalization. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model ( extbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

Problem

Research questions and friction points this paper is trying to address.

Traditional reward models lack interpretability and flexibility

Multidimensional RMs fail to provide feature-level attribution

Need for dynamic adjustment to user preference shifts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoder-enhanced Reward Model (SARM)

Maps hidden activations to interpretable feature space

Dynamic adjustment to preference shifts

🔎 Similar Papers

No similar papers found.