Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) face two key bottlenecks in long-horizon embodied manipulation: scarcity of large-scale sequential data with multi-step language–action alignment, and absence of interpretable, verifiable fine-grained rewards. Method: We propose REVER—a framework integrating (1) a verifiable ordered binary matching reward to guide VLMs toward physically plausible, spatiotemporally coherent, and interpretable reasoning traces; (2) a hardware-agnostic Universal Manipulation Interface for automated collection and annotation of vision–instruction–plan triplets; and (3) end-to-end fine-tuning via reinforcement learning with dense rewards. Contribution/Results: The lightweight RoboFarseer model achieves >40% improvement over state-of-the-art baselines in open-ended task planning and boosts overall success rate by ~60% on long-horizon real-world manipulation tasks—significantly advancing reliable deployment of VLMs in embodied intelligence.

Technology Category

Application Category

📝 Abstract
Enabling robots to execute long-horizon manipulation tasks from free-form language instructions remains a fundamental challenge in embodied AI. While vision-language models (VLMs) have shown promise as high-level planners, their deployment in the real world is hindered by two gaps: (i) the scarcity of large-scale, sequential manipulation data that couples natural language with multi-step action plans, and (ii) the absence of dense, interpretable rewards for fine-tuning VLMs on planning objectives. To address these issues, we propose REVER, a framework that empowers VLMs to generate and validate long-horizon manipulation plans from natural language instructions in real-world scenarios. Under REVER we train and release RoboFarseer, a VLM incentivized to emit chain-of-thought that perform temporal and spatial reasoning, ensuring physically plausible and logically coherent plans. To obtain training data, we leverage the Universal Manipulation Interface framework to capture hardware-agnostic demonstrations of atomic skills. An automated annotation engine converts each demonstration into vision-instruction-plan triplet. We introduce a verifiable reward that scores the generated plan by its ordered bipartite matching overlap with the ground-truth skill sequence. At run time, the fine-tuned VLM functions both as a planner and as a monitor, verifying step-wise completion. RoboFarseer matches or exceeds the performance of proprietary models that are orders of magnitude larger, while on open-ended planning it surpasses the best baseline by more than 40%. In real-world, long-horizon tasks, the complete system boosts overall success by roughly 60% compared with the same low-level controller without the planner. We will open-source both the dataset and the trained model upon publication.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity for language-guided robotic manipulation tasks
Providing verifiable rewards for vision-language model fine-tuning
Enabling long-horizon planning from natural language instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training VLM with verifiable reward for planning
Automated annotation of vision-instruction-plan triplets
VLM functions as both planner and completion monitor
🔎 Similar Papers
No similar papers found.
Z
Zitong Bo
Xiaomi Robotics Lab
Y
Yue Hu
Xiaomi Robotics Lab, College of Computer Science and Technology, Zhejiang University
Jinming Ma
Jinming Ma
University of Science and Technology of China
reinforcement learning
M
Mingliang Zhou
Xiaomi Robotics Lab
J
Junhui Yin
Xiaomi Robotics Lab
Y
Yachen Kang
Xiaomi Robotics Lab
Y
Yuqi Liu
Xiaomi Robotics Lab
T
Tong Wu
Xiaomi Robotics Lab
D
Diyun Xiang
Xiaomi Robotics Lab
H
Hao Chen