Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how fine-grained role prompting affects the diversity of synthetic instructions and responses generated by large language models (LLMs). Method: We propose the first systematic, multi-dimensional evaluation framework integrating lexical diversity metrics (TTR, MTLD) with redundancy analysis, and conduct controlled sampling experiments across varying model scales and role granularity levels. Contribution/Results: (1) Synthetic instructions exhibit significantly lower lexical diversity than human-authored data; (2) Role prompting consistently enhances output diversity—particularly in larger models; (3) However, further refining role descriptions (e.g., adding background, personality traits) yields no statistically significant diversity improvement. These findings reveal a diminishing-return relationship between role specification granularity and response diversity in role-driven data synthesis. The work provides empirical evidence and methodological guidance for efficiently constructing high-quality synthetic instruction datasets.

Technology Category

Application Category

📝 Abstract
Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.
Problem

Research questions and friction points this paper is trying to address.

Measuring diversity of synthetic prompts and responses
Comparing synthetic and human-written prompt diversity
Assessing impact of persona detail on text diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Measure diversity using lexical metrics
Compare synthetic and human prompts
Test persona detail impact on diversity
🔎 Similar Papers
No similar papers found.