Learning the Preferences of a Learning Agent

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Traditional inverse reinforcement learning (IRL) relies on the strong assumption that observed behaviors are nearly optimal, making it ill-suited for scenarios where agents—human or artificial—are still learning and adapting. This work formally introduces and systematically investigates the problem of inferring an agent’s underlying reward function from the behavior of an online learner, modeling the agent either as a no-regret learner or as one asymptotically converging to a Boltzmann-optimal policy. By integrating inverse reinforcement learning with online learning theory and Boltzmann policy modeling, we develop a preference learning framework endowed with theoretical guarantees. We characterize the learnability boundaries of our approach under different learning models, establishing theoretical feasibility in certain settings while revealing fundamental limitations in others.

📝 Abstract

For AI systems to be useful to humans, they must understand and act in accordance with our values and preferences. Since specifying preferences is a hard task, inverse reinforcement learning (IRL) aims to develop methods that allow for inferring preferences from observed behavior. However, IRL assumes the human to be approximately optimal. This is a big limitation in cases where the human themselves may be learning to act optimally in an environment. In this paper, we formalize the problem of learning the preferences of a learning agent: a predictor observes a learner acting online and tries to infer the underlying reward function being (initially suboptimally) optimized by the learner. We model the learner as either being no-regret, or as converging to an optimal Boltzmann policy over time. In each of these settings, we establish theoretical guarantees for various preference learning algorithms, or otherwise show that such guarantees are impossible.

Problem

Research questions and friction points this paper is trying to address.

preference learning

inverse reinforcement learning

learning agent

reward inference

suboptimal behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference learning

inverse reinforcement learning

learning agent