🤖 AI Summary
Existing synthetic data frameworks often rely on centralized orchestrators—limiting scalability—or employ domain-specific, hard-coded designs—compromising generality. To address these limitations, we propose the first decentralized multi-agent synthetic data generation framework. Our approach eliminates the central scheduler and instead adopts a peer-to-peer architecture built on Ray’s distributed message queue, decoupling control flow from data flow. Lightweight agent coordination and service-oriented computation offloading—including LLM inference and containerized execution environments—enable modular configuration and parallel execution of heterogeneous tasks. Experiments demonstrate that, under identical hardware constraints, our framework achieves 2–15× higher throughput without sacrificing output quality. Furthermore, extensive evaluation across diverse synthetic data generation scenarios confirms its superior scalability, flexibility, and generality.
📝 Abstract
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present extbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15 imes$ higher data generation throughput under identical hardware resources, without compromising output quality.