🤖 AI Summary
Existing preference-based reinforcement learning suffers from unreliable reward modeling under noisy preference labels—e.g., those elicited from human annotators or vision-language models. To address this, we propose a three-teacher collaborative teaching framework: three reward models are trained in parallel; dynamic knowledge distillation is performed using low-loss samples; and expert demonstrations guide the selection of high-quality preference pairs. Remarkably, only 1–3 expert demonstrations suffice to suppress up to 40% preference noise. Our approach introduces the first “mutual-teaching” mechanism—requiring no explicit noise modeling or label-cleaning modules. Evaluated across diverse robotic manipulation tasks, it achieves a 90% success rate, substantially outperforming state-of-the-art methods. This demonstrates strong generalization and robustness against preference-label noise.
📝 Abstract
Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback. Project page: https://shuaiyihuang.github.io/publications/TREND.