🤖 AI Summary
This work addresses the challenge of unreliable policy divergence estimation in directed acyclic graphs, which leads to unstable and inflexible training in GFlowNets. To overcome this, the authors propose a novel flow-matching objective based on partial trajectories, leveraging the flow balance condition—used here for the first time—to construct a policy evaluator that unifies value function estimation and policy optimization perspectives. The resulting framework naturally supports parameterized backward policies and is compatible with offline data, significantly enhancing training stability and adaptability. Empirical evaluations on both synthetic and real-world tasks demonstrate the method’s effectiveness, yielding more reliable policy learning and improved data efficiency.
📝 Abstract
Generative Flow Networks (GFlowNets) were developed to learn policies for efficiently sampling combinatorial candidates by interpreting their generative processes as trajectories in directed acyclic graphs. In the value-based training workflow, the objective is to enforce the balance over partial episodes between the flows of the learned policy and the estimated flows of the desired policy, implicitly encouraging policy divergence minimization. The policy-based strategy alternates between estimating the policy divergence and updating the policy, but reliable estimation of the divergence under directed acyclic graphs remains a major challenge. This work bridges the two perspectives by showing that flow balance also yields a principled policy evaluator that measures the divergence, and an evaluation balance objective over partial episodes is proposed for learning the evaluator. As demonstrated on both synthetic and real-world tasks, evaluation balance not only strengthens the reliability of policy-based training but also broadens its flexibility by seamlessly supporting parameterized backward policies and enabling the integration of offline data-collection techniques.