🤖 AI Summary
This work addresses efficient sampling from unnormalized densities (e.g., molecular Boltzmann distributions). Methodologically, it proposes an off-policy reinforcement learning framework that integrates amortized inference with sequential Monte Carlo (SMC), jointly optimizing a neural sampling policy and value function under a maximum-entropy RL objective, using SMC-generated samples for policy updates. It further introduces annealed importance-weighted experience replay and an adaptive importance-weighted replay mechanism to enhance exploration and training stability. The key contribution lies in the synergistic co-design of distortion-function modeling, importance-weighted replay, and SMC’s intrinsic structure, significantly improving approximation accuracy and sampling efficiency for multimodal continuous/discrete distributions—exemplified by the alanine dipeptide conformational space. Experiments demonstrate superior distribution-fitting quality and training robustness compared to amortized variational inference and standard MCMC baselines.
📝 Abstract
This paper proposes a synergy of amortised and particle-based methods for sampling from distributions defined by unnormalised density functions. We state a connection between sequential Monte Carlo (SMC) and neural sequential samplers trained by maximum-entropy reinforcement learning (MaxEnt RL), wherein learnt sampling policies and value functions define proposal kernels and twist functions. Exploiting this connection, we introduce an off-policy RL training procedure for the sampler that uses samples from SMC -- using the learnt sampler as a proposal -- as a behaviour policy that better explores the target distribution. We describe techniques for stable joint training of proposals and twist functions and an adaptive weight tempering scheme to reduce training signal variance. Furthermore, building upon past attempts to use experience replay to guide the training of neural samplers, we derive a way to combine historical samples with annealed importance sampling weights within a replay buffer. On synthetic multi-modal targets (in both continuous and discrete spaces) and the Boltzmann distribution of alanine dipeptide conformations, we demonstrate improvements in approximating the true distribution as well as training stability compared to both amortised and Monte Carlo methods.