๐ค AI Summary
This work addresses the scarcity of digital footprint data, which hinders behavioral modeling and model generalization. To overcome this limitation, the authors propose a large language model (LLM) agent framework grounded in structured user personas, enabling the first systematic generation of high-fidelity, diverse user event sequences along with their corresponding digital artifactsโsuch as emails, messages, and calendar entries. The approach substantially enhances both the realism and diversity of synthetic data compared to existing baselines. Furthermore, models fine-tuned on the generated data demonstrate markedly improved out-of-distribution generalization on real-world tasks, underscoring the utility of the proposed framework for advancing behavioral AI research.
๐ Abstract
Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.