Nimbus: A Unified Embodied Synthetic Data Generation Framework

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing synthetic data generation pipelines for embodied intelligence are fragmented and task-specific, hindering the high-throughput and stable data production required for large-scale foundation model training. This work proposes a unified synthetic data generation framework featuring a modular, four-layer architecture and a dynamic scheduling mechanism that integrates heterogeneous navigation and manipulation task pipelines. The system enables asynchronous coordination among trajectory planning, rendering, and storage, while incorporating global load balancing, distributed fault tolerance, and customized rendering optimizations to efficiently orchestrate CPU, GPU, and I/O resources. Experiments demonstrate a 2–3× improvement in end-to-end throughput over baseline systems, achieving—for the first time—cross-task, high-throughput, and long-duration stable synthetic data generation for embodied agents, thereby enabling seamless operation of the InternData suite in large-scale distributed environments.

Technology Category

Application Category

📝 Abstract

Scaling data volume and diversity is critical for generalizing embodied intelligence. While synthetic data generation offers a scalable alternative to expensive physical data acquisition, existing pipelines remain fragmented and task-specific. This isolation leads to significant engineering inefficiency and system instability, failing to support the sustained, high-throughput data generation required for foundation model training. To address these challenges, we present Nimbus, a unified synthetic data generation framework designed to integrate heterogeneous navigation and manipulation pipelines. Nimbus introduces a modular four-layer architecture featuring a decoupled execution model that separates trajectory planning, rendering, and storage into asynchronous stages. By implementing dynamic pipeline scheduling, global load balancing, distributed fault tolerance, and backend-specific rendering optimizations, the system maximizes resource utilization across CPU, GPU, and I/O resources. Our evaluation demonstrates that Nimbus achieves a 2-3X improvement in end-to-end throughput compared to unoptimized baselines and ensuring robust, long-term operation in large-scale distributed environments. This framework serves as the production backbone for the InternData suite, enabling seamless cross-domain data synthesis.

Problem

Research questions and friction points this paper is trying to address.

synthetic data generation

embodied intelligence

foundation model training

data scalability

pipeline fragmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified synthetic data generation

modular architecture

asynchronous pipeline execution