PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Preference-based Reinforcement Learning (PbRL) holds promise for acquiring complex robotic behaviors without hand-crafted reward functions, yet faces three key challenges: high human feedback cost, ambiguous preference queries, and difficult credit assignment. To address these, we propose a hierarchical neuro-symbolic fusion framework that jointly leverages large language models (LLMs) and vision-language models (VLMs) to generate multimodal, semantically grounded feedback. We introduce foresight trajectory pre-generation to enhance query clarity and integrate hindsight counterfactual trajectory augmentation with a causal auxiliary loss to improve credit assignment. Evaluated on two locomotion and eight manipulation tasks, our method significantly outperforms baseline PbRL models and scripted policies. Results demonstrate substantial reductions in human annotation effort, mitigation of reward ambiguity, and improved cross-task generalization—validating the efficacy of our neuro-symbolic, multimodal, and causally informed approach to scalable PbRL.

Technology Category

Application Category

📝 Abstract

Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.

Problem

Research questions and friction points this paper is trying to address.

Reduces human input reliance in robot learning

Resolves query ambiguity in reward learning

Improves credit assignment with counterfactual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical neuro-symbolic fusion strategy

Foresight trajectory generation with bootstrapped samples

Hindsight trajectory augmentation with causal loss

🔎 Similar Papers

PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers