V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of existing scene generation methods, which often fail to balance semantic richness and physical feasibility due to insufficient contextual awareness, leading to robotic task failures from unreachable goals. To overcome this, the authors propose an embodied agent framework that leverages foundation models to bridge high-level semantic reasoning with low-level physical interaction, enabling end-to-end autonomous data synthesis. Key innovations include context-aware scene construction guided by image inpainting, a vision-language model–based closed-loop verification mechanism to filter out silent failures, and a perception-driven video compression algorithm. The approach achieves over 90% data compression without compromising the training performance of downstream vision-language-action (VLA) models, substantially enhancing the scalability and efficiency of generating high-quality robotic manipulation datasets.

Technology Category

Application Category

📝 Abstract

Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.

Problem

Research questions and friction points this paper is trying to address.

scene generation

context-awareness

semantic coherence

physical feasibility

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action

Inpainting-Guided Scene Construction

Closed-loop Verification

Embodied Agentic System

Perceptually-driven Compression

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

2024-09-19Citations: 5