RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation

📅 2025-01-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address long-horizon, sparse-reward, and visually complex robotic manipulation tasks, this paper proposes RSPA (Recognize-Sense-Plan-Act), an LLM-augmented multi-view world model framework. Methodologically, it integrates pre-trained large language models (LLMs), vision-based world models, and model-based reinforcement learning (MBRL). Key contributions include: (1) the first LLM-driven multi-stage dense reward generation mechanism to mitigate reward sparsity; (2) embedding keyframe discovery into a multi-view masked autoencoder (MAE) to enable cross-view spatiotemporal representation alignment; and (3) automatic task decomposition and cross-stage perception–planning coupling guided by natural language instructions. Evaluated on RLBench and FurnitureBench, RSPA significantly outperforms state-of-the-art methods: achieving a 23.35% absolute improvement in success rate on short-horizon tasks and a 29.23% gain on long-horizon furniture assembly tasks.

Technology Category

Application Category

📝 Abstract
Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.
Problem

Research questions and friction points this paper is trying to address.

Complex Visual Reinforcement Learning
Sparse Reward
Long-Horizon Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

RSPA Process
Large Language Model
Enhanced Task Success Rate
🔎 Similar Papers
No similar papers found.
Z
Zixuan Chen
State Key Laboratory for Novel Software Technology, Nanjing University, China
Jing Huo
Jing Huo
Nanjing University
Machine LearningComputer Vision
Yangtao Chen
Yangtao Chen
Master Student, Nanjing University, China
EmbodiedAIRobotics
Y
Yang Gao
State Key Laboratory for Novel Software Technology, Nanjing University, China