Interpretable Reward Modeling with Active Concept Bottlenecks

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the lack of interpretability and heavy reliance on dense human annotations in Reinforcement Learning from Human Feedback (RLHF), this work proposes the Concept Bottleneck Reward Model (CB-RM). CB-RM introduces human-interpretable concepts as intermediate semantic representations to enable interpretable preference learning. We further design an active learning strategy based on expected information gain to dynamically select the most discriminative concept instances for annotation, substantially reducing labeling cost. This is the first work to integrate concept bottleneck mechanisms into reward modeling and the first to jointly incorporate concept supervision and active sampling in this context. Experiments on the UltraFeedback dataset show that CB-RM achieves comparable preference prediction accuracy to fully supervised baselines (Δ < 0.5%) using significantly fewer concept annotations—accelerating labeling by 2.3×—while delivering high interpretability, strong sample efficiency, and auditability.

Technology Category

Application Category

📝 Abstract

We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.

Problem

Research questions and friction points this paper is trying to address.

Enables interpretable reward modeling via concept annotation

Improves sample efficiency with active concept learning

Enhances transparency in human-aligned reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept Bottleneck Reward Models for interpretability

Active learning with Expected Information Gain

Decomposes reward into human-interpretable concepts

🔎 Similar Papers

LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning