Learning from Preferences and Mixed Demonstrations in General Settings

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Designing reward functions for complex sequential tasks is challenging due to the difficulty of manually encoding domain knowledge. Method: This paper proposes Reward-Rational Partial Orderings (RRPO), a domain-agnostic reward learning framework that unifies heterogeneous human feedback—including pairwise preferences and partial-order demonstrations (including negative examples)—without requiring domain priors. Building upon RRPO, we introduce LEOPARD, the first algorithm to jointly optimize inverse reinforcement learning objectives and sequence-level partial-order constraints. Contribution/Results: Experiments across robotics control, navigation, and text generation demonstrate that LEOPARD achieves significant improvements over state-of-the-art baselines using only a small amount of mixed feedback. It exhibits strong robustness, generalization across diverse tasks, and compatibility with heterogeneous data sources.

Technology Category

Application Category

📝 Abstract

Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.

Problem

Research questions and friction points this paper is trying to address.

Learning reward functions from mixed human feedback types

Overcoming limitations of existing preference and demonstration methods

Developing scalable algorithm for general reinforcement learning settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-rational partial orderings framework

LEOPARD algorithm combining preferences demonstrations

Learning reward functions from mixed feedback

🔎 Similar Papers

Learning Adaptive Multi-Objective Robot Navigation Incorporating Demonstrations