MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
📝 Abstract
Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

dense reward design
robotic reinforcement learning
Vision-Language Models
spatial grounding
task semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models
Multi-Stage Guidance
Robotic Manipulation
Reward Design
Spatial Grounding
🔎 Similar Papers
No similar papers found.
X
Xunlan Zhou
School of Intelligent Science and Technology, Nanjing University, China; National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, China
X
Xuanlin Chen
School of Intelligent Science and Technology, Nanjing University, China; National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, China
S
Shaowei Zhang
National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
X
Xiangkun Li
School of Computer Science and Technology, Beijing Institute of Technology, China
S
ShengHua Wan
National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
X
Xiaohai Hu
MACS Lab, University of Washington
Lei Yuan
Lei Yuan
Nanjing University
Machine LearningReinforcement LearningMulti-Agent SystemsEmbodied AI
Le Gan
Le Gan
Nanjing University of Science and Technology
Artificial IntelligenceMachine Learning
D
De-chuan Zhan
National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China