Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment

📅 2024-05-24

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

211K/year

🤖 AI Summary

To address the high noise, cost, and privacy risks associated with large-scale human preference annotations in large language model (LLM) alignment, this paper proposes Alignment from Demonstrations (AfD), a novel paradigm. AfD is the first to introduce inverse reinforcement learning (IRL) into LLM alignment, formally framing demonstration-based learning under unknown reward functions. We theoretically characterize the trade-off between mass-covering and mode-seeking objectives in reward inference. Our method employs KL-divergence minimization for efficient reward extrapolation and policy distillation, integrating IRL, sequential decision modeling, and reward modeling—without requiring any preference labels. On the Harmless/Helpful benchmarks, AfD significantly outperforms supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), demonstrating strong empirical performance alongside implementation simplicity.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Efficiency and Accuracy

Privacy Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Reinforcement Learning

Alignment from Demonstration (AfD)

Large Language Models (LLMs) Optimization

🔎 Similar Papers

No similar papers found.