🤖 AI Summary
To address the lack of interpretability and heavy reliance on dense human annotations in Reinforcement Learning from Human Feedback (RLHF), this work proposes the Concept Bottleneck Reward Model (CB-RM). CB-RM introduces human-interpretable concepts as intermediate semantic representations to enable interpretable preference learning. We further design an active learning strategy based on expected information gain to dynamically select the most discriminative concept instances for annotation, substantially reducing labeling cost. This is the first work to integrate concept bottleneck mechanisms into reward modeling and the first to jointly incorporate concept supervision and active sampling in this context. Experiments on the UltraFeedback dataset show that CB-RM achieves comparable preference prediction accuracy to fully supervised baselines (Δ < 0.5%) using significantly fewer concept annotations—accelerating labeling by 2.3×—while delivering high interpretability, strong sample efficiency, and auditability.
📝 Abstract
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.