Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing synthetic data frameworks often rely on centralized orchestrators—limiting scalability—or employ domain-specific, hard-coded designs—compromising generality. To address these limitations, we propose the first decentralized multi-agent synthetic data generation framework. Our approach eliminates the central scheduler and instead adopts a peer-to-peer architecture built on Ray’s distributed message queue, decoupling control flow from data flow. Lightweight agent coordination and service-oriented computation offloading—including LLM inference and containerized execution environments—enable modular configuration and parallel execution of heterogeneous tasks. Experiments demonstrate that, under identical hardware constraints, our framework achieves 2–15× higher throughput without sacrificing output quality. Furthermore, extensive evaluation across diverse synthetic data generation scenarios confirms its superior scalability, flexibility, and generality.

Technology Category

Application Category

📝 Abstract

Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present extbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15 imes$ higher data generation throughput under identical hardware resources, without compromising output quality.

Problem

Research questions and friction points this paper is trying to address.

Decentralizes multi-agent synthetic data generation to remove scalability bottlenecks

Enables flexible, modular workflows across diverse data synthesis scenarios

Improves throughput significantly without compromising output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized peer-to-peer framework eliminates central orchestrator

Lightweight agents and distributed services handle tasks independently

Modular design scales to tens of thousands of concurrent workflows

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models