LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

242K/year
🤖 AI Summary
Existing robotic manipulation policies exhibit significant performance degradation in long-horizon tasks, yet there is a lack of real-world benchmarks capable of revealing the root causes of such failures. To address this gap, this work proposes LongBench—a novel mechanism-aware evaluation framework for long-horizon manipulation, comprising over 1,000 real-world task segments spanning both context-independent and context-dependent scenarios. Tasks are systematically categorized along dimensions of capability and ambiguity, enabling comprehensive assessment of execution robustness, temporal consistency, and contextual reasoning. Evaluations of six state-of-the-art policies on LongBench reveal that performance in fully observable tasks is primarily limited by execution robustness, the impact of contextual difficulty varies across tasks, and current memory mechanisms fail to provide consistent performance gains.

Technology Category

Application Category

📝 Abstract
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor. We observe that performance in fully observable settings is more strongly associated with execution robustness, while contextual difficulty varies across tasks and is not consistently improved by memory-based methods. We hope that LongBench serves as a useful benchmark for studying long-horizon manipulation and for developing policies with stronger robustness across both execution and contextual challenges.
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
real-world benchmark
temporal difficulty
execution robustness
context-dependent reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon manipulation
real-world benchmark
execution robustness
context-dependent reasoning
temporal consistency