Flow-based Policy With Distributional Reinforcement Learning in Trajectory Optimization

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reinforcement learning commonly employs diagonal Gaussian policies, which struggle to capture multimodal optimal behaviors and optimize only the mean of the return distribution, thereby neglecting its full structural information and limiting policy performance. This work introduces flow matching into policy modeling for the first time, integrating it with distributional reinforcement learning to construct a policy representation capable of accurately fitting complex, multimodal return distributions. By directly optimizing the entire return distribution to guide policy updates, the proposed method achieves significant performance gains over existing algorithms on MuJoCo continuous control benchmarks, demonstrating not only state-of-the-art results but also enhanced expressiveness in representing policy-induced return distributions.
📝 Abstract
Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.
Problem

Research questions and friction points this paper is trying to address.

multimodal policy
diagonal Gaussian distribution
distributional reinforcement learning
return distribution
trajectory optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
distributional reinforcement learning
multimodal policy
trajectory optimization
return distribution
🔎 Similar Papers
No similar papers found.
R
Ruijie Hao
College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
L
Longfei Zhang
College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Yang Dai
Yang Dai
Shenzhen Institutes of Advanced Technology,Chinese Academy of Sciences
perovskitesmemristor
Y
Yang Ma
Aviation University of Air Force, Changchun 130000, China
X
Xingxing Liang
College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
G
Guangquan Cheng
College of Systems Engineering, National University of Defense Technology, Changsha 410073, China