Generalizing Behavior via Inverse Reinforcement Learning with Closed-Form Reward Centroids

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of generalizing expert demonstrations to novel environments or constraint settings. Inverse reinforcement learning (IRL) recovers reward functions to replicate expert behavior, yet suffers from inherent ill-posedness: infinitely many reward functions can rationalize the same observed behavior. To resolve this ambiguity, we propose regularizing IRL by selecting the centroid of the feasible reward set—a principled criterion ensuring policy consistency and interpretability. We derive, for the first time, a closed-form solution for the centroid over a bounded reward subset. Leveraging offline expert demonstrations, we design an efficient algorithm to estimate this centroid and employ it for downstream planning. Experiments demonstrate that our approach significantly improves cross-environment behavioral generalization, faithfully recovering expert intent while providing theoretical convergence guarantees.

Technology Category

Application Category

📝 Abstract
We study the problem of generalizing an expert agent's behavior, provided through demonstrations, to new environments and/or additional constraints. Inverse Reinforcement Learning (IRL) offers a promising solution by seeking to recover the expert's underlying reward function, which, if used for planning in the new settings, would reproduce the desired behavior. However, IRL is inherently ill-posed: multiple reward functions, forming the so-called feasible set, can explain the same observed behavior. Since these rewards may induce different policies in the new setting, in the absence of additional information, a decision criterion is needed to select which policy to deploy. In this paper, we propose a novel, principled criterion that selects the "average" policy among those induced by the rewards in a certain bounded subset of the feasible set. Remarkably, we show that this policy can be obtained by planning with the reward centroid of that subset, for which we derive a closed-form expression. We then present a provably efficient algorithm for estimating this centroid using an offline dataset of expert demonstrations only. Finally, we conduct numerical simulations that illustrate the relationship between the expert's behavior and the behavior produced by our method.
Problem

Research questions and friction points this paper is trying to address.

Generalizing expert behavior to new environments
Selecting optimal policy from multiple reward functions
Deriving closed-form reward centroids for efficient planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form reward centroids calculation
Offline expert demonstrations algorithm
Average policy selection criterion
🔎 Similar Papers
No similar papers found.