TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing preference-based reinforcement learning suffers from unreliable reward modeling under noisy preference labels—e.g., those elicited from human annotators or vision-language models. To address this, we propose a three-teacher collaborative teaching framework: three reward models are trained in parallel; dynamic knowledge distillation is performed using low-loss samples; and expert demonstrations guide the selection of high-quality preference pairs. Remarkably, only 1–3 expert demonstrations suffice to suppress up to 40% preference noise. Our approach introduces the first “mutual-teaching” mechanism—requiring no explicit noise modeling or label-cleaning modules. Evaluated across diverse robotic manipulation tasks, it achieves a 90% success rate, substantially outperforming state-of-the-art methods. This demonstrates strong generalization and robustness against preference-label noise.

Technology Category

Application Category

📝 Abstract

Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: https://shuaiyihuang.github.io/publications/TREND.

Problem

Research questions and friction points this paper is trying to address.

Address noisy preference feedback in reinforcement learning

Integrate few-shot demonstrations with tri-teaching for noise mitigation

Achieve robust performance with minimal expert demonstrations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tri-teaching strategy for noise mitigation

Integrates few-shot expert demonstrations

Three reward models trained simultaneously

🔎 Similar Papers

Learning Adaptive Multi-Objective Robot Navigation Incorporating Demonstrations