๐ค AI Summary
Existing synthetic population generation methods struggle to simultaneously ensure statistical consistency and enable controllable generation with behavioral semantics. To address this challenge, this work proposes SemaPop, which, for the first time, integrates high-level personality representations extracted by large language models as semantic conditioning into a WGAN-GPโbased generative framework, augmented with marginal regularization constraints. By jointly modeling abstract behavioral patterns and multidimensional statistical structures, SemaPop achieves semantically guided yet statistically consistent population synthesis. The approach significantly improves the fidelity of generated populations to real-world data in both marginal and joint distributions, while preserving sample diversity and feasibility. Consequently, it enhances the controllability and interpretability of synthetic populations without compromising their realism or structural integrity.
๐ Abstract
Population synthesis is a critical component of individual-level socio-economic simulation, yet remains challenging due to the need to jointly represent statistical structure and latent behavioral semantics. Existing population synthesis approaches predominantly rely on structured attributes and statistical constraints, leaving a gap in semantic-conditioned population generation that can capture abstract behavioral patterns implicitly in survey data. This study proposes SemaPop, a semantic-statistical population synthesis model that integrates large language models (LLMs) with generative population modeling. SemaPop derives high-level persona representations from individual survey records and incorporates them as semantic conditioning signals for population generation, while marginal regularization is introduced to enforce alignment with target population marginals. In this study, the framework is instantiated using a Wasserstein GAN with gradient penalty (WGAN-GP) backbone, referred to as SemaPop-GAN. Extensive experiments demonstrate that SemaPop-GAN achieves improved generative performance, yielding closer alignment with target marginal and joint distributions while maintaining sample-level feasibility and diversity under semantic conditioning. Ablation studies further confirm the contribution of semantic persona conditioning and architectural design choices to balancing marginal consistency and structural realism. These results demonstrate that SemaPop-GAN enables controllable and interpretable population synthesis through effective semantic-statistical information fusion. SemaPop-GAN also provides a promising modular foundation for developing generative population projection systems that integrate individual-level behavioral semantics with population-level statistical constraints.