EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the modality misalignment between vision and language, as well as inconsistent planning, in complex embodied long-horizon manipulation tasks, this paper proposes a unified vision-language multimodal generative framework. The framework jointly models linguistic reasoning and visual generation via cross-modal attention, and incorporates dynamic perception pretraining—integrating inverse and forward dynamics—with reinforcement-guided fine-tuning to achieve bidirectional alignment between textual logic and visual-spatial representations. Innovatively, we design instruction-tuning and reinforcement-alignment losses to jointly optimize the discrete image–language sequence distribution. Our approach significantly improves cross-modal consistency and task success rates on long-horizon benchmarks, and—uniquely—enables explicit spatial awareness within collaborative reasoning and planning.

Technology Category

Application Category

📝 Abstract

In complex embodied long-horizon manipulation tasks, effective task decomposition and execution require synergistic integration of textual logical reasoning and visual-spatial imagination to ensure efficient and accurate operation. Current methods fail to adopt a unified generation framework for multimodal planning, lead to inconsistent in multimodal planning. To address this challenge, we present extbf{EVLP (Embodied Vision-Language Planner)}, an innovative multimodal unified generation framework that jointly models linguistic reasoning and visual generation. Our approach achieves multimodal planning for long-horizon tasks through a novel training pipeline incorporating dynamic pretraining and reinforced alignment. Our core innovations consist of three key components: extbf{1) Unified Multimodal Generation Framework}: For understanding, We integrate semantic information with spatial features to provide comprehensive visual perception. For generation, we directly learn the joint distribution of discrete images for one-step visual synthesis, enabling coordinated language-visual modeling through learnable cross-modal attention mechanisms. extbf{2) Dynamic Perception Pretraining}: We propose a bidirectional dynamic alignment strategy employing inverse dynamics tasks and forward dynamics tasks, effectively strengthening multimodal correlations within a unified feature space. extbf{3) Reinforced Supervised Fine-Tuning}: While conducting instruction-based fine-tuning in the unified generation space, we construct a reinforce loss to align the spatial logic between textual actions and generated images, enabling the model to acquire spatio-awared multimodal planning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Developing unified multimodal planning for embodied long-horizon manipulation tasks

Integrating textual reasoning with visual-spatial imagination for task decomposition

Addressing inconsistent multimodal generation in current embodied AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal generation framework for joint modeling

Dynamic pretraining with bidirectional alignment strategy

Reinforced fine-tuning aligning spatial logic in planning

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments