🤖 AI Summary
This work addresses the challenge in reinforcement learning that handcrafted reward functions are time-consuming to design and often misaligned with task objectives. To this end, the paper introduces the Trajectory Alignment Coefficient (TAC) as a metric quantifying the consistency between a reward function and expert preferences, and for the first time employs TAC both as an auxiliary signal for reward tuning and as a direct learning objective for reward modeling. Furthermore, the authors develop Soft-TAC, a differentiable approximation of TAC, enabling end-to-end learning of reward models from human preference data. Experiments demonstrate that in Lunar Lander, TAC significantly improves reward function performance while reducing the cognitive burden of manual tuning; in Gran Turismo 7, reward models trained with Soft-TAC yield more diverse and effective behavioral policies compared to standard cross-entropy approaches.
📝 Abstract
The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.