🤖 AI Summary
This work addresses the distributional mismatch between simulation and experiment in scientific and engineering domains, where simulations—though grounded in physical laws—are subject to approximation errors, while experimental observations, though real, only partially capture the system state. To bridge this gap, the authors propose a domain-agnostic Adversarial Distribution Alignment (ADA) framework: a generative model is first pretrained on complete but biased simulation data and then aligned with partial yet authentic experimental observables via adversarial training. Theoretically, the method recovers the target distribution under multidimensional, correlated observations, and it achieves, for the first time, effective alignment between atomistic generative models and real experimental data. Validation on synthetic benchmarks, molecular systems, and protein experiments demonstrates that ADA substantially narrows the simulation-to-experiment gap and successfully accomplishes cross-domain distribution alignment.
📝 Abstract
A fundamental challenge in science and engineering is the simulation-to-experiment gap. While we often possess prior knowledge of physical laws, these physical laws can be too difficult to solve exactly for complex systems. Such systems are commonly modeled using simulators, which impose computational approximations. Meanwhile, experimental measurements more faithfully represent the real world, but experimental data typically consists of observations that only partially reflect the system's full underlying state. We propose a data-driven distribution alignment framework that bridges this simulation-to-experiment gap by pre-training a generative model on fully observed (but imperfect) simulation data, then aligning it with partial (but real) observations of experimental data. While our method is domain-agnostic, we ground our approach in the physical sciences by introducing Adversarial Distribution Alignment (ADA). This method aligns a generative model of atomic positions -- initially trained on a simulated Boltzmann distribution -- with the distribution of experimental observations. We prove that our method recovers the target observable distribution, even with multiple, potentially correlated observables. We also empirically validate our framework on synthetic, molecular, and experimental protein data, demonstrating that it can align generative models with diverse observables. Our code is available at https://kaityrusnelson.com/ada/.