🤖 AI Summary
To address the modality misalignment between vision and language, as well as inconsistent planning, in complex embodied long-horizon manipulation tasks, this paper proposes a unified vision-language multimodal generative framework. The framework jointly models linguistic reasoning and visual generation via cross-modal attention, and incorporates dynamic perception pretraining—integrating inverse and forward dynamics—with reinforcement-guided fine-tuning to achieve bidirectional alignment between textual logic and visual-spatial representations. Innovatively, we design instruction-tuning and reinforcement-alignment losses to jointly optimize the discrete image–language sequence distribution. Our approach significantly improves cross-modal consistency and task success rates on long-horizon benchmarks, and—uniquely—enables explicit spatial awareness within collaborative reasoning and planning.
📝 Abstract
In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present extbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: extbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. extbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. extbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.