🤖 AI Summary
This work addresses the lack of structural priors in neural network weight generation and initialization. We propose Gradient Flow Matching (GFM), the first framework to extend diffusion models and flow-based generative methods to weight-space learning, explicitly modeling optimization trajectories as an inductive bias in the generative process. Key innovations include adjoint-matching fine-tuning, context-conditional control, and an informativeness-driven source distribution design. We further construct a conditional generation architecture by jointly leveraging autoencoder latent representations and the Kaiming uniform initialization distribution. Experiments demonstrate that GFM achieves superior or competitive performance over existing baselines in terms of weight distribution fidelity, downstream task initialization quality, and post-initialization fine-tuning efficacy. Moreover, GFM enables effective detection of harmful covariate shifts in safety-critical systems, thereby enhancing deployment robustness.
📝 Abstract
Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques under the framework of gradient flow matching, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.