🤖 AI Summary
Existing large language models excel at generic text generation but struggle to replicate users’ personalized linguistic styles—e.g., in email or social media replies—while real-world social interaction data remains inaccessible due to privacy constraints. To address this, we formulate the “Your Next Word Prediction” task and introduce the first multilingual, multi-turn personalized dialogue benchmark spanning Chinese, English, and Japanese, comprising 100 consecutive, multi-day dialogues. We pioneer a psychologically grounded, controllable style modeling framework by integrating MBTI personality dimensions into synthetic character-based dialogue generation. Methodologically, we combine prompt learning with fine-tuning to explicitly capture individual-level linguistic patterns. This benchmark enables reproducible evaluation of personalized response generation and establishes a foundational resource for user-aligned language modeling, offering both a novel paradigm and empirical infrastructure for advancing stylistically adaptive LMs.
📝 Abstract
Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users' internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: https://github.com/AnonymousHub4Submissions/your-next-token-prediction-dataset-100