PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing generative AI systems rely on natural-language personas to guide synthetic data generation, which often induces unintended semantic inferences and leads to imprecise output control. To address this, we propose PILOT, the first framework that maps natural-language personas onto multidimensional, standardized psycholinguistic profiles—schema-based representations enabling measurable, cross-dimensional regulation of both linguistic and psychological attributes. Our method employs a two-stage pipeline: (1) structured persona modeling, followed by (2) persona-driven LLM generation; we further introduce a hybrid guidance strategy that jointly optimizes diversity and structural balance. Experiments on Mistral Large 2, DeepSeek-R1, and LLaMA-3.3-70B demonstrate substantial improvements over baselines: silhouette coefficient increases by 142% (0.098 → 0.237), topic purity rises by 24% (0.773 → 0.957), artificial repetition markedly declines, and expert evaluations confirm no degradation in response quality.

Technology Category

Application Category

📝 Abstract

Generative AI applications commonly leverage user personas as a steering mechanism for synthetic data generation, but reliance on natural language representations forces models to make unintended inferences about which attributes to emphasize, limiting precise control over outputs. We introduce PILOT (Psychological and Linguistic Output Targeting), a two-phase framework for steering large language models with structured psycholinguistic profiles. In Phase 1, PILOT translates natural language persona descriptions into multidimensional profiles with normalized scores across linguistic and psychological dimensions. In Phase 2, these profiles guide generation along measurable axes of variation. We evaluate PILOT across three state-of-the-art LLMs (Mistral Large 2, Deepseek-R1, LLaMA 3.3 70B) using 25 synthetic personas under three conditions: Natural-language Persona Steering (NPS), Schema-Based Steering (SBS), and Hybrid Persona-Schema Steering (HPS). Results demonstrate that schema-based approaches significantly reduce artificial-sounding persona repetition while improving output coherence, with silhouette scores increasing from 0.098 to 0.237 and topic purity from 0.773 to 0.957. Our analysis reveals a fundamental trade-off: SBS produces more concise outputs with higher topical consistency, while NPS offers greater lexical diversity but reduced predictability. HPS achieves a balance between these extremes, maintaining output variety while preserving structural consistency. Expert linguistic evaluation confirms that PILOT maintains high response quality across all conditions, with no statistically significant differences between steering approaches.

Problem

Research questions and friction points this paper is trying to address.

Improving control over synthetic data generation in AI

Reducing unintended inferences from natural language personas

Balancing output coherence with lexical diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translates personas to structured psycholinguistic profiles

Guides generation using measurable multidimensional scores

Balances output coherence with schema-based steering

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

2024-04-20arXiv.orgCitations: 2