🤖 AI Summary
This study addresses the scalability limitations of traditional maximum entropy-based population synthesis methods, which suffer from exponential state-space explosion when handling more than 20 categorical attributes. To overcome this challenge, the authors propose GibbsPCDSolver, a novel approach that introduces persistent contrastive divergence (PCD) into this domain for the first time. By integrating Gibbs sampling with stochastic gradient optimization, the method continuously updates a pool of synthetic individuals to approximate model expectations without explicitly enumerating the full tuple space. This framework effectively breaks the dimensionality barrier, enabling efficient modeling of population distributions with up to 50 attributes. In experiments with K = 12–50, the average mean relative error (MRE) remains stable within [0.010, 0.018]; on the Syn-ISTAT benchmark, it achieves an MRE of 0.03, yields an effective sample size equal to the true sample size N, and improves diversity by 86.8× compared to generalized raking.
📝 Abstract
Maximum entropy (MaxEnt) modelling provides a principled framework for generating synthetic populations from aggregate census data, without access to individual-level microdata. The bottleneck of existing approaches is exact expectation computation, which requires summing over the full tuple space $\cX$ and becomes infeasible for more than $K \approx 20$ categorical attributes. We propose \emph{GibbsPCDSolver}, a stochastic replacement for this computation based on Persistent Contrastive Divergence (PCD): a persistent pool of $N$ synthetic individuals is updated by Gibbs sweeps at each gradient step, providing a stochastic approximation of the model expectations without ever materialising $\cX$. We validate the approach on controlled benchmarks and on \emph{Syn-ISTAT}, a $K{=}15$ Italian demographic benchmark with analytically exact marginal targets derived from ISTAT-inspired conditional probability tables. Scaling experiments across $K \in \{12, 20, 30, 40, 50\}$ confirm that GibbsPCDSolver maintains $\MRE \in [0.010, 0.018]$ while $|\cX|$ grows eighteen orders of magnitude, with runtime scaling as $O(K)$ rather than $O(|\cX|)$. On Syn-ISTAT, GibbsPCDSolver reaches $\MRE{=}0.03$ on training constraints and -- crucially -- produces populations with effective sample size $\Neff = N$ versus $\Neff \approx 0.012\,N$ for generalised raking, an $86.8{\times}$ diversity advantage that is essential for agent-based urban simulations.