Scaling Synthetic Data Creation with 1,000,000,000 Personas

πŸ“… 2024-06-28
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 89
✨ Influential: 11
πŸ“„ PDF
πŸ€– AI Summary
Synthetic data generation faces bottlenecks in diversity deficiency and heavy reliance on manual curation. Method: This paper proposes the Persona Hub paradigmβ€”a scalable, automated framework that constructs a billion-scale, diverse persona repository mined from web data. Leveraging LLMs’ multi-perspective sampling capability, it integrates data-driven persona discovery, prompt engineering, and structured pipelines to efficiently generate high-quality synthetic data across multiple modalities: mathematical reasoning problems, user instructions, knowledge texts, game NPCs, and utility functions. Contribution/Results: Persona Hub establishes the first scalable, reusable, and persona-grounded synthetic data generation framework. It significantly enhances data diversity and quality across downstream tasks, enables out-of-the-box production of hundred-billion-scale datasets, and advances synthetic data generation from labor-intensive curation toward large-scale, persona-aware automation.

Technology Category

Application Category

πŸ“ Abstract
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
Problem

Research questions and friction points this paper is trying to address.

Scaling synthetic data creation using billion diverse personas
Generating diverse synthetic data for various scenarios via LLM
Enhancing LLM research with versatile persona-driven data synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Persona-driven synthesis using billion diverse personas
Automated Persona Hub from web for scalability
Versatile applications in reasoning and NPCs
πŸ”Ž Similar Papers
No similar papers found.
X
Xin Chan
Tencent AI Lab Seattle
X
Xiaoyang Wang
Tencent AI Lab Seattle
D
Dian Yu
Tencent AI Lab Seattle
Haitao Mi
Haitao Mi
Principal Researcher, Tencent US
Large Language Models
D
Dong Yu
Tencent AI Lab Seattle