🤖 AI Summary
In sparse-reward reinforcement learning, per-step reward assignment is often inaccurate, hindering efficient policy optimization.
Method: This paper proposes a likelihood-based reward redistribution framework that formulates reward redistribution as a leave-one-out (LOO) likelihood maximization problem over a state-action-dependent parametric probability distribution, augmented with an endogenous uncertainty regularization term. We theoretically show that conventional mean-squared error (MSE)-based methods arise as a special case of our framework under Gaussian assumptions. The method is designed for seamless integration with off-policy algorithms such as Soft Actor-Critic (SAC).
Results: Evaluated on Box2d and MuJoCo benchmarks, our approach significantly improves sample efficiency and final policy performance. Empirical results demonstrate its effectiveness in generating dense, information-rich rewards and confirm strong generalization across diverse continuous-control tasks.
📝 Abstract
In many practical reinforcement learning scenarios, feedback is provided only at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state--action pairs. In this paper, we propose a emph{Likelihood Reward Redistribution} (LRR) framework that addresses this issue by modeling each per-step reward with a parametric probability distribution whose parameters depend on the state--action pair. By maximizing the likelihood of the observed episodic return via a leave-one-out (LOO) strategy that leverages the entire trajectory, our framework inherently introduces an uncertainty regularization term into the surrogate objective. Moreover, we show that the conventional mean squared error (MSE) loss for reward redistribution emerges as a special case of our likelihood framework when the uncertainty is fixed under the Gaussian distribution. When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on Box-2d and MuJoCo benchmarks.