๐ค AI Summary
This work addresses the limitation of existing graph diffusion models, which neglect edge directionality and thus struggle to capture critical semantic structures such as data flow in neural architectures. The authors propose the first reinforcement learningโguided discrete graph diffusion model tailored for directed acyclic graph (DAG) generation, integrating topological node ordering with positional encoding to enable controllable DAG synthesis. Remarkably, the method learns transferable structural priors using only 7% of the search space for pretraining, achieving state-of-the-art performance across all three tasks in NAS-Bench-201. After fine-tuning, the generated architectures attain accuracy within 0.32% of models trained on full datasets while surpassing their training upper bound by 7.3%. Furthermore, an inverse optimization accuracy of 9.5% validates the effectiveness of the proposed reward mechanism.
๐ Abstract
Reinforcement learning fine-tuning has proven effective for steering generative diffusion models toward desired properties in image and molecular domains. Graph diffusion models have similarly been applied to combinatorial structure generation, including neural architecture search (NAS). However, neural architectures are directed acyclic graphs (DAGs) where edge direction encodes functional semantics such as data flow-information that existing graph diffusion methods, designed for undirected structures, discard. We propose Directed Graph Policy Optimization (DGPO), which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding. Validated on NAS-Bench-101 and NAS-Bench-201, DGPO matches the benchmark optimum on all three NAS-Bench-201 tasks (91.61%, 73.49%, 46.77%). The central finding is that the model learns transferable structural priors: pretrained on only 7% of the search space, it generates near-oracle architectures after fine-tuning, within 0.32 percentage points of the full-data model and extrapolating 7.3 percentage points beyond its training ceiling. Bidirectional control experiments confirm genuine reward-driven steering, with inverse optimization reaching near random-chance accuracy (9.5%). These results demonstrate that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.