Learning Explainable Dense Reward Shapes via Bayesian Optimization

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing RLHF methods rely on sparse, sequence-level scalar rewards, leading to inaccurate token-level credit assignment and poor interpretability. To address this, we propose an explainable AI–inspired dense reward shaping framework. Our approach is the first to integrate attribution methods—such as SHAP and LIME—into reward shaping function design, formulating token-level credit assignment as a differentiable optimization problem. We further introduce a bilevel Bayesian optimization scheme for noise-robust parameter learning and provide theoretical guarantees that additive feature attribution preserves the optimal policy. Experiments demonstrate substantial improvements in token-level credit assignment fidelity, accelerated policy convergence, and superior performance over mainstream RLHF baselines across multiple downstream tasks.

Technology Category

Application Category

📝 Abstract

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

Problem

Research questions and friction points this paper is trying to address.

Improves token-level credit assignment in RLHF

Uses explainability methods for per-token reward estimation

Optimizes reward shaping via Bayesian Optimization framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable token-level reward shaping via SHAP/LIME

Bilevel optimization combining Bayesian and policy training

Feature additive attribution maintains optimal policy integrity

🔎 Similar Papers

Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning