🤖 AI Summary
This work addresses a critical limitation in existing diffusion-based reinforcement learning methods, which typically neglect explicit modeling of the critic. Accurate value estimation—particularly in stochastic environments—requires distributional representations of value functions for effective policy optimization. To this end, we propose the Diffusion Bridge Critic (DBC), the first approach to integrate diffusion bridge models into reinforcement learning critics. DBC directly models the inverse cumulative distribution function of the Q-value, leveraging the strong distribution-matching capabilities of diffusion bridges to precisely capture the value distribution. We derive an analytical integration formula that eliminates discretization errors inherent in prior methods. DBC avoids degeneration into trivial Gaussian distributions and is plug-and-play compatible with mainstream reinforcement learning frameworks. Experiments on MuJoCo continuous control tasks demonstrate that DBC significantly outperforms existing distributional critic methods, validating its superiority in both value estimation accuracy and policy performance.
📝 Abstract
Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.