PrefMMT: Modeling Human Preferences in Preference-based Reinforcement Learning with Multimodal Transformers

📅 2024-09-20

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses inaccurate human preference modeling in preference-based reinforcement learning (PbRL), challenging the restrictive Markov assumption. We propose the first multimodal Transformer architecture specifically designed for preference modeling. Our method decouples state and action sequences as distinct modalities and employs a hierarchical design to jointly capture intra-modal temporal dependencies and inter-modal state–action interactions, enabling non-Markovian, multimodal joint preference representation. Key innovations include state–action decoupled encoding, hierarchical temporal modeling, and a preference sequence learning mechanism. Evaluated on D4RL locomotion and Meta-World manipulation benchmarks, our approach significantly outperforms existing preference modeling methods, achieving improved accuracy and robustness in aligning learned policies with human preferences.

Technology Category

Application Category

📝 Abstract

Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. We demonstrate that PrefMMT consistently outperforms state-of-the-art PM baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the Meta-World benchmark.

Problem

Research questions and friction points this paper is trying to address.

Modeling human preferences in robot behavior trajectories.

Capturing temporal dependencies and multimodal interactions in trajectories.

Improving preference-based reinforcement learning with multimodal transformers.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal transformer for preference modeling

Disentangles state and action modalities

Hierarchical intra-modal and inter-modal interactions

🔎 Similar Papers

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment