🤖 AI Summary
Sparse and delayed rewards severely hinder the practical deployment of reinforcement learning (RL). To address this, we propose ARES—the first fully offline reward shaping method that requires neither online interaction nor task-specific prior knowledge. Leveraging Transformer-based attention mechanisms, ARES automatically learns dense, shaped rewards from low-quality trajectory data (e.g., generated by random policies). It is algorithm-agnostic, seamlessly integrating with any RL framework, and exhibits strong robustness to extreme reward delay and data noise—without restricting applicability to goal-oriented tasks. Empirical evaluations across diverse environments and RL algorithms demonstrate that ARES significantly improves sample efficiency and convergence speed, enabling efficient solutions to tasks previously intractable or requiring prohibitively large datasets.
📝 Abstract
Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.