Learning from Preferences and Mixed Demonstrations in General Settings

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Designing reward functions for complex sequential tasks is challenging due to the difficulty of manually encoding domain knowledge. Method: This paper proposes Reward-Rational Partial Orderings (RRPO), a domain-agnostic reward learning framework that unifies heterogeneous human feedback—including pairwise preferences and partial-order demonstrations (including negative examples)—without requiring domain priors. Building upon RRPO, we introduce LEOPARD, the first algorithm to jointly optimize inverse reinforcement learning objectives and sequence-level partial-order constraints. Contribution/Results: Experiments across robotics control, navigation, and text generation demonstrate that LEOPARD achieves significant improvements over state-of-the-art baselines using only a small amount of mixed feedback. It exhibits strong robustness, generalization across diverse tasks, and compatibility with heterogeneous data sources.

Technology Category

Application Category

📝 Abstract
Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.
Problem

Research questions and friction points this paper is trying to address.

Learning reward functions from mixed human feedback types
Overcoming limitations of existing preference and demonstration methods
Developing scalable algorithm for general reinforcement learning settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-rational partial orderings framework
LEOPARD algorithm combining preferences demonstrations
Learning reward functions from mixed feedback
🔎 Similar Papers
No similar papers found.