DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited physical commonsense of existing vision-language models (VLMs) when deployed as zero-shot planners, which leads to low success rates and error accumulation in complex manipulation tasks involving deformable objects. To bridge the gap between semantic reasoning and physical execution, we propose DreamPlan, a novel framework that integrates an action-conditional video generation world model with Odds Ratio Policy Optimization (ORPO). Starting from a zero-shot VLM, we first collect interaction data to train a video world model; then, without any additional real-world interaction, we fine-tune the VLM via ORPO using imagined rollouts from the world model to inject physical commonsense. Our approach significantly improves task success—particularly in deformable object manipulation—while drastically reducing reliance on real-world data and environment interactions.

Technology Category

Application Category

📝 Abstract
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Reinforcement Learning
Robotic Manipulation
Physical Grounding
Sample Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

video world models
reinforcement fine-tuning
vision-language models
ORPO
virtual rollouts
E
Emily Yue-Ting Jia
USC Physical Superintelligence Lab
Weiduo Yuan
Weiduo Yuan
Master Student, USC
Robot LearningVLA
T
Tianheng Shi
USC Physical Superintelligence Lab
V
Vitor Guizilini
Toyota Research Institute
Jiageng Mao
Jiageng Mao
University of Southern California
RoboticsComputer Vision
Yue Wang
Yue Wang
USC
Computer VisionRobotics