🤖 AI Summary
Preference-based Reinforcement Learning (PbRL) holds promise for acquiring complex robotic behaviors without hand-crafted reward functions, yet faces three key challenges: high human feedback cost, ambiguous preference queries, and difficult credit assignment. To address these, we propose a hierarchical neuro-symbolic fusion framework that jointly leverages large language models (LLMs) and vision-language models (VLMs) to generate multimodal, semantically grounded feedback. We introduce foresight trajectory pre-generation to enhance query clarity and integrate hindsight counterfactual trajectory augmentation with a causal auxiliary loss to improve credit assignment. Evaluated on two locomotion and eight manipulation tasks, our method significantly outperforms baseline PbRL models and scripted policies. Results demonstrate substantial reductions in human annotation effort, mitigation of reward ambiguity, and improved cross-task generalization—validating the efficacy of our neuro-symbolic, multimodal, and causally informed approach to scalable PbRL.
📝 Abstract
Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.