V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

๐Ÿ“… 2026-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key challenges in long-horizon embodied intelligence learningโ€”namely, physical implausibility, semantic inconsistency, and the difficulty of translating high-level instructions into executable actions in synthetic data. To this end, the authors propose V-CAGE, a closed-loop framework that ensures geometric consistency through context-aware scene instantiation, maps high-level goals to composable action primitives via hierarchical instruction decomposition, and incorporates a vision-language model (VLM)-driven verification loop for semantic validation. By integrating dynamic forbidden-region mapping with a rejection sampling mechanism, V-CAGE substantially enhances the physical and semantic fidelity of synthesized data. The resulting dataset significantly improves both the success rate and generalization capability of downstream embodied policies.

Technology Category

Application Category

๐Ÿ“ Abstract
Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently"succeed"without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g.,"get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out"silent failures"where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.
Problem

Research questions and friction points this paper is trying to address.

long-horizon embodied tasks
physically implausible scenes
task semantics
instruction grounding
synthetic data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware generation
hierarchical instruction decomposition
VLM-based verification
long-horizon embodied tasks
semantic fidelity
๐Ÿ”Ž Similar Papers
No similar papers found.