🤖 AI Summary
Existing synthetic data generation pipelines for embodied intelligence are fragmented and task-specific, hindering the high-throughput and stable data production required for large-scale foundation model training. This work proposes a unified synthetic data generation framework featuring a modular, four-layer architecture and a dynamic scheduling mechanism that integrates heterogeneous navigation and manipulation task pipelines. The system enables asynchronous coordination among trajectory planning, rendering, and storage, while incorporating global load balancing, distributed fault tolerance, and customized rendering optimizations to efficiently orchestrate CPU, GPU, and I/O resources. Experiments demonstrate a 2–3× improvement in end-to-end throughput over baseline systems, achieving—for the first time—cross-task, high-throughput, and long-duration stable synthetic data generation for embodied agents, thereby enabling seamless operation of the InternData suite in large-scale distributed environments.
📝 Abstract
Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, existing pipelines remain fragmented and task-specific. This isolation leads to significant engineering inefficiency and system instability, failing to support the sustained, high-throughput data generation required for foundation model training. To address these challenges, we present Nimbus, a unified synthetic data generation framework designed to integrate heterogeneous navigation and manipulation pipelines. Nimbus introduces a modular four-layer architecture featuring a decoupled execution model that separates trajectory planning, rendering, and storage into asynchronous stages. By implementing dynamic pipeline scheduling, global load balancing, distributed fault tolerance, and backend-specific rendering optimizations, the system maximizes resource utilization across CPU, GPU, and I/O resources. Our evaluation demonstrates that Nimbus achieves a 2-3X improvement in end-to-end throughput compared to unoptimized baselines and ensuring robust, long-term operation in large-scale distributed environments. This framework serves as the production backbone for the InternData suite, enabling seamless cross-domain data synthesis.