TROFI: Trajectory-Ranked Offline Inverse Reinforcement Learning

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of reward function unavailability in offline reinforcement learning (RL) by proposing TROFI, a preference-based policy learning framework that operates without predefined reward functions. Methodologically, TROFI integrates offline inverse RL with trajectory preference modeling to infer an implicit reward function directly from suboptimal and imperfect human preference data; this inferred reward is then used to label offline datasets for policy training—eliminating reliance on expert demonstrations or ground-truth reward labels. The approach operates under weak supervision and achieves performance on the D4RL benchmark comparable to reward-supervised baselines, significantly outperforming existing reward-free methods. Empirical validation in 3D game environments further demonstrates its practical deployability. Overall, TROFI extends the applicability of offline RL to reward-agnostic domains—including game development and human–AI interaction—where explicit reward engineering is infeasible or undesirable.

Technology Category

Application Category

📝 Abstract
In offline reinforcement learning, agents are trained using only a fixed set of stored transitions derived from a source policy. However, this requires that the dataset be labeled by a reward function. In applied settings such as video game development, the availability of the reward function is not always guaranteed. This paper proposes Trajectory-Ranked OFfline Inverse reinforcement learning (TROFI), a novel approach to effectively learn a policy offline without a pre-defined reward function. TROFI first learns a reward function from human preferences, which it then uses to label the original dataset making it usable for training the policy. In contrast to other approaches, our method does not require optimal trajectories. Through experiments on the D4RL benchmark we demonstrate that TROFI consistently outperforms baselines and performs comparably to using the ground truth reward to learn policies. Additionally, we validate the efficacy of our method in a 3D game environment. Our studies of the reward model highlight the importance of the reward function in this setting: we show that to ensure the alignment of a value function to the actual future discounted reward, it is fundamental to have a well-engineered and easy-to-learn reward function.
Problem

Research questions and friction points this paper is trying to address.

Learn policy offline without predefined reward function
Derive reward function from human preferences
Ensure alignment of value function to future rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns reward function from human preferences
Labels dataset without predefined rewards
Performs well without optimal trajectories
🔎 Similar Papers
A
Alessandro Sestini
SEED - Electronic Arts (EA), Stockholm, Sweden
J
Joakim Bergdahl
SEED - Electronic Arts (EA), Stockholm, Sweden
K
Konrad Tollmar
SEED - Electronic Arts (EA), Stockholm, Sweden
Andrew D. Bagdanov
Andrew D. Bagdanov
Associate Professor, University of Florence, Italy
Computer visiondeep learningartificial intelligencedeep reinforcement learningimage processing
L
Linus Gisslén
SEED - Electronic Arts (EA), Stockholm, Sweden