🤖 AI Summary
This study investigates how fine-grained role prompting affects the diversity of synthetic instructions and responses generated by large language models (LLMs). Method: We propose the first systematic, multi-dimensional evaluation framework integrating lexical diversity metrics (TTR, MTLD) with redundancy analysis, and conduct controlled sampling experiments across varying model scales and role granularity levels. Contribution/Results: (1) Synthetic instructions exhibit significantly lower lexical diversity than human-authored data; (2) Role prompting consistently enhances output diversity—particularly in larger models; (3) However, further refining role descriptions (e.g., adding background, personality traits) yields no statistically significant diversity improvement. These findings reveal a diminishing-return relationship between role specification granularity and response diversity in role-driven data synthesis. The work provides empirical evidence and methodological guidance for efficiently constructing high-quality synthetic instruction datasets.
📝 Abstract
Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. Firstly, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. We find that while persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.