🤖 AI Summary
To address the bottlenecks of scarce high-quality instruction data, privacy sensitivity, and prohibitive annotation costs in large language model (LLM) post-training, this paper proposes MATRIX—a novel framework grounded in a first-of-its-kind multi-agent social simulation paradigm to generate high-fidelity, diverse textual scenarios. It introduces MATRIX-Gen, a scenario-driven controllable instruction generator that balances generalizability and domain adaptability in synthetic data production. Furthermore, it incorporates a reinforcement learning–aware data filtering mechanism to enhance synthetic data quality. When fine-tuning Llama-3-8B-Base on only 20K synthetic samples, MATRIX achieves superior performance on AlpacaEval 2 and Arena-Hard compared to Llama-3-8B-Instruct—trained on over 10 million real-world samples—thereby significantly alleviating data dependency and privacy constraints in LLM instruction tuning.
📝 Abstract
Post-training is essential for enabling large language models (LLMs) to follow human instructions. However, its effectiveness depends on high-quality instruction data, which is challenging to obtain in the real world due to privacy concerns, data scarcity, and high annotation costs. To fill this gap, inspired by the recent success of using LLMs to simulate human society, we propose MATRIX, a multi-agent simulator that automatically generates diverse text-based scenarios, capturing a wide range of real-world human needs in a realistic and scalable manner. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model, which was trained on over 10M pairs.