Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic data frameworks often rely on centralized orchestrators—limiting scalability—or employ domain-specific, hard-coded designs—compromising generality. To address these limitations, we propose the first decentralized multi-agent synthetic data generation framework. Our approach eliminates the central scheduler and instead adopts a peer-to-peer architecture built on Ray’s distributed message queue, decoupling control flow from data flow. Lightweight agent coordination and service-oriented computation offloading—including LLM inference and containerized execution environments—enable modular configuration and parallel execution of heterogeneous tasks. Experiments demonstrate that, under identical hardware constraints, our framework achieves 2–15× higher throughput without sacrificing output quality. Furthermore, extensive evaluation across diverse synthetic data generation scenarios confirms its superior scalability, flexibility, and generality.

Technology Category

Application Category

📝 Abstract
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present extbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15 imes$ higher data generation throughput under identical hardware resources, without compromising output quality.
Problem

Research questions and friction points this paper is trying to address.

Decentralizes multi-agent synthetic data generation to remove scalability bottlenecks
Enables flexible, modular workflows across diverse data synthesis scenarios
Improves throughput significantly without compromising output quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized peer-to-peer framework eliminates central orchestrator
Lightweight agents and distributed services handle tasks independently
Modular design scales to tens of thousands of concurrent workflows
🔎 Similar Papers
No similar papers found.
D
Dong Wang
FAIR at Meta
Y
Yang Li
FAIR at Meta
Ansong Ni
Ansong Ni
PhD Student, Yale University
Machine LearningNatural Language ProcessingSoftware Engineering
Ching-Feng Yeh
Ching-Feng Yeh
FAIR at Meta
Y
Youssef Emad
FAIR at Meta
X
Xinjie Lei
FAIR at Meta
L
Liam Robbins
FAIR at Meta
K
Karthik Padthe
FAIR at Meta
H
Hu Xu
FAIR at Meta
X
Xian Li
FAIR at Meta
Asli Celikyilmaz
Asli Celikyilmaz
Researcher @ FAIR at Meta
Deep LearningNatural Language Processing
Ramya Raghavendra
Ramya Raghavendra
IBM TJ Watson Research Center
L
Lifei Huang
FAIR at Meta
Carole-Jean Wu
Carole-Jean Wu
Meta AI / FAIR
Machine Learning SystemsComputer ArchitectureMemory Subsystem DesignEnergySustainability
S
Shang-Wen Li
FAIR at Meta