🤖 AI Summary
This work addresses the challenge of explicitly guiding crystal structure generation toward optimizing target physical properties—such as energy—during inference in continuous-time generative models. To this end, we introduce reinforcement learning into crystal structure prediction for the first time, proposing a policy gradient–based framework that directly operates on the velocity field learned by a flow-matching model. By incorporating stochastic dynamical perturbations, our approach enables property-guided generation and efficient exploration without requiring explicit score function computation. The method adaptively learns a time-dependent velocity annealing schedule, achieving significant energy optimization while preserving structural diversity. Empirically, it improves sampling efficiency by an order of magnitude and matches or even surpasses score-based reinforcement learning approaches in both generation speed and performance.
📝 Abstract
Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.