Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Reinforcement learning faces two key challenges in real-world applications—such as robotics, industrial automation, and healthcare—namely, the difficulty of designing reward functions and unsafe exploration. To address these, this paper proposes BRIDGE: a two-stage algorithm that first learns an initial policy offline from reward-free expert demonstrations via behavior cloning, then refines it online using human preference feedback. This is the first work to provide rigorous theoretical analysis for offline-to-online preference-based RL. Its core innovation is an uncertainty-weighted fusion mechanism that jointly leverages behavioral cloning and preference signals, enabling provably convergent regret bounds that improve with increasing offline data size. Empirical evaluation on MuJoCo discrete- and continuous-control benchmarks shows that BRIDGE significantly outperforms pure behavior cloning and standalone online preference RL, achieving superior sample efficiency while ensuring safe policy improvement.

Technology Category

Application Category

📝 Abstract

Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.

Problem

Research questions and friction points this paper is trying to address.

Overcoming reward specification and unsafe exploration in reinforcement learning applications

Fine-tuning safe policies from demonstrations with human preference feedback

Improving sample efficiency by integrating offline data with online learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuning policies with preference-based reinforcement learning

Integrating offline demonstrations and online human feedback

Uncertainty-weighted objective for improved sample efficiency

🔎 Similar Papers

Personalization in Human-Robot Interaction through Preference-based Action Representation Learning