ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the limitations of existing robotic manipulation evaluation, which heavily relies on simulation and fails to capture real-world factors such as perceptual noise, contact dynamics, and hardware constraints, while physical-world benchmarks lack standardized protocols. To bridge this gap, we propose ManipArena—a unified evaluation framework that integrates a high-fidelity real-to-sim synchronous environment with standardized tasks. It features 20 long-horizon mobile manipulation tasks emphasizing semantic and spatial reasoning, along with 10,812 expert demonstration trajectories. ManipArena supports multimodal sensing diagnostics, out-of-distribution generalization tests, and world model validation, enabling—for the first time—cross-platform, reproducible evaluation that balances simulation controllability with real-world complexity. This framework establishes a fair, realistic, and scalable benchmark for Vision-Language-Action models and embodied intelligence systems.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.

Problem

Research questions and friction points this paper is trying to address.

robot manipulation

evaluation benchmark

reality gap

embodied intelligence

real-world deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

ManipArena

reasoning-oriented manipulation

real-to-sim alignment