V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses key challenges in long-horizon embodied intelligence learning—namely, physical implausibility, semantic inconsistency, and the difficulty of translating high-level instructions into executable actions in synthetic data. To this end, the authors propose V-CAGE, a closed-loop framework that ensures geometric consistency through context-aware scene instantiation, maps high-level goals to composable action primitives via hierarchical instruction decomposition, and incorporates a vision-language model (VLM)-driven verification loop for semantic validation. By integrating dynamic forbidden-region mapping with a rejection sampling mechanism, V-CAGE substantially enhances the physical and semantic fidelity of synthesized data. The resulting dataset significantly improves both the success rate and generalization capability of downstream embodied policies.

Technology Category

Application Category

📝 Abstract

Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently"succeed"without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g.,"get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out"silent failures"where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.

Problem

Research questions and friction points this paper is trying to address.

long-horizon embodied tasks

physically implausible scenes

task semantics

instruction grounding

synthetic data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware generation

hierarchical instruction decomposition

VLM-based verification