SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reward models (RMs) rely on large-scale preference datasets and full-parameter fine-tuning of large language models (LLMs), incurring high computational costs and offering poor interpretability—especially under limited labeled data and compute resources. To address this, we propose SparseRM: a lightweight, interpretable RM that leverages sparse autoencoders (SAEs) to disentangle sparse, human-interpretable preference-related features from frozen LLM representations, then maps them to scalar alignment scores via a lightweight linear reward head. SparseRM uses fewer than 1% of the trainable parameters of conventional RMs, yet outperforms most state-of-the-art RMs across three major preference modeling benchmarks. It supports plug-and-play integration into downstream alignment pipelines such as RLHF. Our key contribution is the first systematic application of SAE-driven sparse representation decomposition to reward modeling—achieving a principled balance among parameter efficiency, feature interpretability, and generalization performance.

Technology Category

Application Category

📝 Abstract
Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.
Problem

Research questions and friction points this paper is trying to address.

Training reward models with limited resources and preference annotations
Reducing computational costs of fine-tuning large language models
Extracting interpretable preference features from model representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoder to extract preference features
Projects representations onto interpretable directions for scoring
Aggregates scores with lightweight reward head
🔎 Similar Papers
No similar papers found.
D
Dengcan Liu
University of Science and Technology of China, Hefei, China
J
Jiahao Li
University of Science and Technology of China, Hefei, China
Zheren Fu
Zheren Fu
University of Science and Technology of China
Multi-modal LearningVision-Language ModelAI Security
Yi Tu
Yi Tu
Ant Group
Computer VisionDocument UnderstandingVision Language Model
J
Jiajun Li
Huawei Technologies Ltd
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP
Y
Yongdong Zhang
University of Science and Technology of China, Hefei, China