World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

In embodied task planning, large vision-language models (VLMs) face two key challenges: difficulty in modeling dependency constraints and low inference efficiency. To address these, we propose D²PO, a novel dual-preference learning framework that, for the first time, internalizes world-modeling capability into planning ability via preference learning, jointly optimizing state prediction and action selection. D²PO introduces an annotation-free mechanism for generating tree-search trajectories and automatically constructing step-level preferences, integrated with a variant of Monte Carlo Tree Search (MCTS) to enable efficient exploration. Evaluated on VoTa-Bench, D²PO instantiated with Qwen2-VL, LLaVA-1.6, and LLAMA-3.2 significantly outperforms existing methods and GPT-4o—achieving higher task success rates, shorter execution paths, and more efficient planning.

Technology Category

Application Category

📝 Abstract

Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.

Problem

Research questions and friction points this paper is trying to address.

Improves embodied task planning using world modeling

Addresses dependency constraints and efficiency in LVLMs

Optimizes state prediction and action selection jointly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Preference Optimization enhances planning via state-action learning.

Tree search mechanism automates trajectory and preference data collection.

D$^2$PO outperforms GPT-4o on VoTa-Bench with higher success rates.

🔎 Similar Papers

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents