🤖 AI Summary
Gaussian probabilistic generative models (GPGMs) for molecular generation suffer from excessive diffusion steps, high computational overhead, and low training/sampling efficiency. Method: This paper proposes an identity-aware Gaussian approximation framework. We first introduce and analyze the “data identity vanishing” property, theoretically derive and identify the Gaussianization critical step, and thereafter replace redundant diffusion trajectories with an exact closed-form Gaussian distribution—preserving full learning-dynamic resolution while eliminating repeated stochastic perturbations. Contribution/Results: The method significantly reduces sampling steps without compromising training granularity or inference fidelity. Experiments demonstrate simultaneous improvements in generation quality and computational efficiency across multimodal molecular generation tasks, establishing a practical and efficient new paradigm for deploying GPGMs.
📝 Abstract
Gaussian-based Probabilistic Generative Models (GPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. While these models have achieved state-of-the-art performance across diverse domains, their practical deployment remains constrained by the high computational cost of long generative trajectories, which often involve hundreds to thousands of steps during training and sampling. In this work, we introduce a theoretically grounded and empirically validated framework that improves generation efficiency without sacrificing training granularity or inference fidelity. Our key insight is that for certain data modalities, the noising process causes data to rapidly lose its identity and converge toward a Gaussian distribution. We analytically identify a characteristic step at which the data has acquired sufficient Gaussianity, and then replace the remaining generation trajectory with a closed-form Gaussian approximation. Unlike existing acceleration techniques that coarsening the trajectories by skipping steps, our method preserves the full resolution of learning dynamics while avoiding redundant stochastic perturbations between `Gaussian-like' distributions. Empirical results across multiple data modalities demonstrate substantial improvements in both sample quality and computational efficiency.