Nimbus: A Unified Embodied Synthetic Data Generation Framework

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic data generation pipelines for embodied intelligence are fragmented and task-specific, hindering the high-throughput and stable data production required for large-scale foundation model training. This work proposes a unified synthetic data generation framework featuring a modular, four-layer architecture and a dynamic scheduling mechanism that integrates heterogeneous navigation and manipulation task pipelines. The system enables asynchronous coordination among trajectory planning, rendering, and storage, while incorporating global load balancing, distributed fault tolerance, and customized rendering optimizations to efficiently orchestrate CPU, GPU, and I/O resources. Experiments demonstrate a 2–3× improvement in end-to-end throughput over baseline systems, achieving—for the first time—cross-task, high-throughput, and long-duration stable synthetic data generation for embodied agents, thereby enabling seamless operation of the InternData suite in large-scale distributed environments.

Technology Category

Application Category

📝 Abstract
Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, existing pipelines remain fragmented and task-specific. This isolation leads to significant engineering inefficiency and system instability, failing to support the sustained, high-throughput data generation required for foundation model training. To address these challenges, we present Nimbus, a unified synthetic data generation framework designed to integrate heterogeneous navigation and manipulation pipelines. Nimbus introduces a modular four-layer architecture featuring a decoupled execution model that separates trajectory planning, rendering, and storage into asynchronous stages. By implementing dynamic pipeline scheduling, global load balancing, distributed fault tolerance, and backend-specific rendering optimizations, the system maximizes resource utilization across CPU, GPU, and I/O resources. Our evaluation demonstrates that Nimbus achieves a 2-3X improvement in end-to-end throughput compared to unoptimized baselines and ensuring robust, long-term operation in large-scale distributed environments. This framework serves as the production backbone for the InternData suite, enabling seamless cross-domain data synthesis.
Problem

Research questions and friction points this paper is trying to address.

synthetic data generation
embodied intelligence
foundation model training
data scalability
pipeline fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified synthetic data generation
modular architecture
asynchronous pipeline execution
distributed fault tolerance
embodied intelligence
🔎 Similar Papers
No similar papers found.
Zeyu He
Zeyu He
Ph.D. Student, Penn State University
Natural Language ProcessingHCICrowdsourcing
Y
Yuchang Zhang
Shanghai Artificial Intelligence Laboratory
Y
Yuanzhen Zhou
Shanghai Artificial Intelligence Laboratory
M
Miao Tao
Shanghai Artificial Intelligence Laboratory
H
Hengjie Li
Shanghai Artificial Intelligence Laboratory, Shanghai Innovation Institute
Y
Yang Tian
Shanghai Artificial Intelligence Laboratory
Jia Zeng
Jia Zeng
Shanghai AI Laboratory
Embodied AIRobotic ManipulationVision-Language-Action
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
W
Wen-Bin Cai
Shanghai Artificial Intelligence Laboratory
Yilun Chen
Yilun Chen
Shanghai AI Laboratory
Autonomous DrivingEmbodied AIComputer Vision
N
Ning Gao
Shanghai Artificial Intelligence Laboratory
J
Jiangmiao Pang
Shanghai Artificial Intelligence Laboratory