MobileManiBench: Simplifying Model Verification for Mobile Manipulation

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models rely on teleoperated data from static tabletop scenarios, limiting their transferability to mobile manipulation tasks. To address this gap, this work proposes the first simulation-based evaluation framework tailored for mobile manipulation, built upon NVIDIA Isaac Sim to deliver a high-fidelity environment. The framework introduces MobileManiBench, a large-scale benchmark comprising two robot platforms, 630 objects, 100 diverse scenes, and over 300,000 multimodal annotated trajectories, automatically generated via reinforcement learning. This benchmark enables controlled and scalable evaluation of robot embodiments, perceptual modalities, and policy architectures, significantly enhancing the data efficiency and generalization capabilities of VLA models in complex, dynamic environments.

Technology Category

Application Category

📝 Abstract
Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.
Problem

Research questions and friction points this paper is trying to address.

mobile manipulation
vision-language-action models
model verification
robotic benchmark
teleoperation datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

simulation-first
vision-language-action models
mobile manipulation
autonomous data generation
multi-modal benchmark
🔎 Similar Papers
No similar papers found.