🤖 AI Summary
Existing robotic manipulation policies exhibit significant performance degradation in long-horizon tasks, yet there is a lack of real-world benchmarks capable of revealing the root causes of such failures. To address this gap, this work proposes LongBench—a novel mechanism-aware evaluation framework for long-horizon manipulation, comprising over 1,000 real-world task segments spanning both context-independent and context-dependent scenarios. Tasks are systematically categorized along dimensions of capability and ambiguity, enabling comprehensive assessment of execution robustness, temporal consistency, and contextual reasoning. Evaluations of six state-of-the-art policies on LongBench reveal that performance in fully observable tasks is primarily limited by execution robustness, the impact of contextual difficulty varies across tasks, and current memory mechanisms fail to provide consistent performance gains.
📝 Abstract
Robotic manipulation policies often degrade over extended horizons, yet existing benchmarks provide limited insight into why such failures occur. Most prior benchmarks are either simulation-based or report aggregate success, making it difficult to disentangle the distinct sources of temporal difficulty in real-world execution. We introduce LongBench, a real-world benchmark for evaluating long-horizon manipulation. LongBench consists of over 1,000 real-world episodes, covering two complementary regimes: Context-Independent (fully observable) and Context-Dependent (ambiguity-driven). By organizing tasks into capability- and ambiguity-specific subsets, LongBench enables mechanism-aware evaluation of execution robustness, temporal consistency, and context-dependent reasoning. Evaluating six state-of-the-art policies reveals that long-horizon performance is not governed by a single factor. We observe that performance in fully observable settings is more strongly associated with execution robustness, while contextual difficulty varies across tasks and is not consistently improved by memory-based methods. We hope that LongBench serves as a useful benchmark for studying long-horizon manipulation and for developing policies with stronger robustness across both execution and contextual challenges.