🤖 AI Summary
Existing graph streaming models struggle to effectively align with complex human preferences or task objectives, limiting their generation quality. This work proposes Graph-GRPO, an online reinforcement learning framework for graph streaming that achieves fully differentiable reinforcement learning optimization for the first time. By analytically deriving the transition probabilities of graph streaming models to support differentiable rollouts, and by introducing local perturbation and regeneration mechanisms for nodes and edges to enhance exploration, Graph-GRPO attains 95.0% and 97.5% Valid-Unique-Novelty scores on planar and tree graphs, respectively, using only 50 denoising steps. Furthermore, it significantly outperforms existing graph-based RL, fragment-based RL, and genetic algorithms in molecular optimization tasks, establishing new state-of-the-art performance.
📝 Abstract
Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.