Interpretable Reward Model via Sparse Autoencoder

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reward models (RMs) in RLHF suffer from poor interpretability and limited adaptability to human preferences; existing multi-dimensional RMs still struggle with feature-level attribution and rely heavily on costly manual annotations. To address this, we propose the Sparse Autoencoder-Enhanced Reward Model (SARM), the first RM architecture integrating a pre-trained sparse autoencoder to map LLM hidden-layer activations into an interpretable, sparse, and unambiguous feature space. This enables fine-grained reward attribution and precise alignment with human preferences. Crucially, SARM dynamically disentangles reward components without additional labeling, supporting real-time feature attribution and preference transfer. Experiments across multiple benchmarks demonstrate that SARM significantly outperforms state-of-the-art baselines while achieving superior transparency, flexibility, and generalization. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model ( extbf{SARM}), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.
Problem

Research questions and friction points this paper is trying to address.

Traditional reward models lack interpretability and flexibility
Multidimensional RMs fail to provide feature-level attribution
Need for dynamic adjustment to user preference shifts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoder-enhanced Reward Model (SARM)
Maps hidden activations to interpretable feature space
Dynamic adjustment to preference shifts
🔎 Similar Papers
No similar papers found.
Shuyi Zhang
Shuyi Zhang
East China Normal University
Big data analysisSemi-supervised learningHigh-dimensional statisticsApplied data science
W
Wei Shi
University of Science and Technology of China
S
Sihang Li
University of Science and Technology of China
Jiayi Liao
Jiayi Liao
University of Science and Technology of China
RecommendationLarge Language Model
T
Tao Liang
Douyin Co., Ltd.
Hengxing Cai
Hengxing Cai
Sun Yat-sen University
LLMVLMVLNUAV
X
Xiang Wang
University of Science and Technology of China