Pareto-Optimal Learning from Preferences with Hidden Context

📅 2024-06-21

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper addresses the fairness–performance trade-off in AI value alignment under multi-group preferences. We propose Pareto-Optimal Preference Learning (POPL), a novel framework that achieves group-fair alignment without requiring explicit group labels. Unlike conventional single-objective approaches, POPL explicitly formulates group preferences as a multi-objective optimization problem, integrating lexicase selection, multi-objective optimization theory, and preference learning to enable stateless preference modeling across diverse domains—including Minigrid, Metaworld, and LLM fine-tuning. Its core contribution is the first realization of pluralistic alignment under implicit group structure: POPL simultaneously improves both cross-group policy performance and fairness without access to group annotations. Empirical results demonstrate that POPL significantly outperforms existing methods across multiple benchmarks and supports customizable group fairness control.

Technology Category

Application Category

📝 Abstract

Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.

Problem

Research questions and friction points this paper is trying to address.

Aligns AI models with diverse human preferences

Addresses unfairness in reward estimation

Ensures Pareto-optimal and equitable AI policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pareto Optimal Preference Learning

Lexicase selection process

Pluralistic alignment with trade-offs

🔎 Similar Papers

No similar papers found.