🤖 AI Summary
This work addresses the challenge of training neural samplers to achieve full coverage of multimodal distributions without access to target samples. We propose the first principled framework optimizing the forward KL divergence. Its core innovation is importance-weighted score matching (IW-SM), which—combined with Monte Carlo importance sampling—enables unbiased gradient estimation and guaranteed mode coverage using only unnormalized density evaluations. Theoretical analysis characterizes the bias–variance trade-off in the estimator. Experiments on a 120-mode Gaussian mixture and a symmetric particle system demonstrate consistent superiority over state-of-the-art methods across all metrics: Wasserstein distance, maximum mean discrepancy (MMD), and coverage. Our approach effectively mitigates mode collapse—a well-known limitation of inverse KL-based methods—while requiring no ground-truth samples.
📝 Abstract
Training neural samplers directly from unnormalized densities without access to target distribution samples presents a significant challenge. A critical desideratum in these settings is achieving comprehensive mode coverage, ensuring the sampler captures the full diversity of the target distribution. However, prevailing methods often circumvent the lack of target data by optimizing reverse KL-based objectives. Such objectives inherently exhibit mode-seeking behavior, potentially leading to incomplete representation of the underlying distribution. While alternative approaches strive for better mode coverage, they typically rely on implicit mechanisms like heuristics or iterative refinement. In this work, we propose a principled approach for training diffusion-based samplers by directly targeting an objective analogous to the forward KL divergence, which is conceptually known to encourage mode coverage. We introduce extit{Importance Weighted Score Matching}, a method that optimizes this desired mode-covering objective by re-weighting the score matching loss using tractable importance sampling estimates, thereby overcoming the absence of target distribution data. We also provide theoretical analysis of the bias and variance for our proposed Monte Carlo estimator and the practical loss function used in our method. Experiments on increasingly complex multi-modal distributions, including 2D Gaussian Mixture Models with up to 120 modes and challenging particle systems with inherent symmetries -- demonstrate that our approach consistently outperforms existing neural samplers across all distributional distance metrics, achieving state-of-the-art results on all benchmarks.