🤖 AI Summary
Existing distributional reinforcement learning methods suffer from projection errors, support mismatch, and high-variance bootstrapping. This work proposes Path-Coupled Bellman Flows (PCBF), which introduces, for the first time, source-consistent Bellman-coupled trajectories that avoid imposing distributional Bellman fixed-point constraints at intermediate timesteps. By coupling the current and successor return flows through shared base noise and incorporating a λ-parameterized control variate target, PCBF achieves a flexible bias–variance trade-off. The method integrates continuous-time flow matching with path coupling, significantly improving distributional fidelity and training stability across analytically solvable MRPs, OGBench, and D4RL benchmarks, while attaining competitive offline reinforcement learning performance.
📝 Abstract
Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $λ$-parameterized control-variate target: $λ{=}0$ recovers an unbiased sample Bellman target, while $λ{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.