World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In embodied task planning, large vision-language models (VLMs) face two key challenges: difficulty in modeling dependency constraints and low inference efficiency. To address these, we propose D²PO, a novel dual-preference learning framework that, for the first time, internalizes world-modeling capability into planning ability via preference learning, jointly optimizing state prediction and action selection. D²PO introduces an annotation-free mechanism for generating tree-search trajectories and automatically constructing step-level preferences, integrated with a variant of Monte Carlo Tree Search (MCTS) to enable efficient exploration. Evaluated on VoTa-Bench, D²PO instantiated with Qwen2-VL, LLaVA-1.6, and LLAMA-3.2 significantly outperforms existing methods and GPT-4o—achieving higher task success rates, shorter execution paths, and more efficient planning.

Technology Category

Application Category

📝 Abstract
Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.
Problem

Research questions and friction points this paper is trying to address.

Improves embodied task planning using world modeling
Addresses dependency constraints and efficiency in LVLMs
Optimizes state prediction and action selection jointly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Preference Optimization enhances planning via state-action learning.
Tree search mechanism automates trajectory and preference data collection.
D$^2$PO outperforms GPT-4o on VoTa-Bench with higher success rates.
🔎 Similar Papers
2024-10-04International Conference on Learning RepresentationsCitations: 0
S
Siyin Wang
Fudan University, Shanghai Innovation Institute
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Q
Qinyuan Cheng
Fudan University
Shiduo Zhang
Shiduo Zhang
Fudan University
Embodied AIFoundation Models
P
Panpan Cai
National University of Singapore, Shanghai Jiao Tong University
Jinlan Fu
Jinlan Fu
National University of Singapore
Natural Language ProcessingVision and LanguageLarge Language Model
X
Xipeng Qiu
Fudan University, Shanghai Innovation Institute