π€ AI Summary
This work addresses the limitations of existing scene generation methods, which often fail to balance semantic richness and physical feasibility due to insufficient contextual awareness, leading to robotic task failures from unreachable goals. To overcome this, the authors propose an embodied agent framework that leverages foundation models to bridge high-level semantic reasoning with low-level physical interaction, enabling end-to-end autonomous data synthesis. Key innovations include context-aware scene construction guided by image inpainting, a vision-language modelβbased closed-loop verification mechanism to filter out silent failures, and a perception-driven video compression algorithm. The approach achieves over 90% data compression without compromising the training performance of downstream vision-language-action (VLA) models, substantially enhancing the scalability and efficiency of generating high-quality robotic manipulation datasets.
π Abstract
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.