Reward Inside the Model: A Lightweight Hidden-State Reward Model for LLM's Best-of-N sampling

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale text-level reward models suffer from high parameter counts and computational overhead, hindering practical deployment in LLM inference. This paper proposes ELHSR—a hyper-lightweight reward model that directly leverages LLM hidden states to construct reward signals, eliminating the need for backbone fine-tuning and supporting either logits or hidden-state inputs. ELHSR employs only a linear projection layer and hidden-state aggregation, reducing parameters to just 0.005% of baseline models; it is compatible with closed-source LLMs. Evaluated across multiple reasoning tasks, ELHSR significantly improves Best-of-N sampling quality while accelerating single-sample inference by several orders of magnitude and drastically cutting FLOPs. Moreover, it achieves effective training with only minimal supervision. ELHSR thus offers exceptional efficiency, broad compatibility, and seamless integration into existing LLM pipelines.

Technology Category

Application Category

📝 Abstract
High-quality reward models are crucial for unlocking the reasoning potential of large language models (LLMs), with best-of-N voting demonstrating significant performance gains. However, current reward models, which typically operate on the textual output of LLMs, are computationally expensive and parameter-heavy, limiting their real-world applications. We introduce the Efficient Linear Hidden State Reward (ELHSR) model - a novel, highly parameter-efficient approach that leverages the rich information embedded in LLM hidden states to address these issues. ELHSR systematically outperform baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training. ELHSR also achieves orders-of-magnitude efficiency improvement with significantly less time and fewer FLOPs per sample than baseline reward models. Moreover, ELHSR exhibits robust performance even when trained only on logits, extending its applicability to some closed-source LLMs. In addition, ELHSR can also be combined with traditional reward models to achieve additional performance gains.
Problem

Research questions and friction points this paper is trying to address.

High-quality reward models are computationally expensive.
Current reward models limit real-world applications due to size.
ELHSR improves efficiency and performance with fewer parameters.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM hidden states for efficiency
Uses minimal parameters for high performance
Combines with traditional models for gains