๐ค AI Summary
To address the high noise, cost, and privacy risks associated with large-scale human preference annotations in large language model (LLM) alignment, this paper proposes Alignment from Demonstrations (AfD), a novel paradigm. AfD is the first to introduce inverse reinforcement learning (IRL) into LLM alignment, formally framing demonstration-based learning under unknown reward functions. We theoretically characterize the trade-off between mass-covering and mode-seeking objectives in reward inference. Our method employs KL-divergence minimization for efficient reward extrapolation and policy distillation, integrating IRL, sequential decision modeling, and reward modelingโwithout requiring any preference labels. On the Harmless/Helpful benchmarks, AfD significantly outperforms supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), demonstrating strong empirical performance alongside implementation simplicity.
๐ Abstract
Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.