RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

To address long-horizon, sparse-reward, and visually complex robotic manipulation tasks, this paper proposes RSPA (Recognize-Sense-Plan-Act), an LLM-augmented multi-view world model framework. Methodologically, it integrates pre-trained large language models (LLMs), vision-based world models, and model-based reinforcement learning (MBRL). Key contributions include: (1) the first LLM-driven multi-stage dense reward generation mechanism to mitigate reward sparsity; (2) embedding keyframe discovery into a multi-view masked autoencoder (MAE) to enable cross-view spatiotemporal representation alignment; and (3) automatic task decomposition and cross-stage perception–planning coupling guided by natural language instructions. Evaluated on RLBench and FurnitureBench, RSPA significantly outperforms state-of-the-art methods: achieving a 23.35% absolute improvement in success rate on short-horizon tasks and a 29.23% gain on long-horizon furniture assembly tasks.

Technology Category

Application Category

📝 Abstract

Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.

Problem

Research questions and friction points this paper is trying to address.

Complex Visual Reinforcement Learning

Sparse Reward

Long-Horizon Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

RSPA Process

Large Language Model

Enhanced Task Success Rate

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments

2024-07-14arXiv.orgCitations: 1

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

2024-07-26Citations: 2