🤖 AI Summary
This study addresses the limited utility of traditional differential privacy (DP) methods on high-dimensional user profile data and investigates whether large language models (LLMs) can faithfully reconstruct the original data distribution under privacy constraints. It presents the first systematic evaluation of PersonaLedger, an LLM-based surrogate simulator, for DP-compliant synthetic data generation. Using DP-sanitized profiles as seeds, the framework incorporates temporal behavior simulation, distribution alignment analysis, and downstream fraud detection tasks (achieving AUC=0.70 at ε=1). The findings reveal that intrinsic priors in LLMs induce significant distributional shifts in both demographic and temporal characteristics, thereby degrading statistical fidelity. This work offers new insights into the privacy–utility trade-off in synthetic data generation with foundation models.
📝 Abstract
LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases--learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.