MobileManiBench: Simplifying Model Verification for Mobile Manipulation

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing vision-language-action (VLA) models rely on teleoperated data from static tabletop scenarios, limiting their transferability to mobile manipulation tasks. To address this gap, this work proposes the first simulation-based evaluation framework tailored for mobile manipulation, built upon NVIDIA Isaac Sim to deliver a high-fidelity environment. The framework introduces MobileManiBench, a large-scale benchmark comprising two robot platforms, 630 objects, 100 diverse scenes, and over 300,000 multimodal annotated trajectories, automatically generated via reinforcement learning. This benchmark enables controlled and scalable evaluation of robot embodiments, perceptual modalities, and policy architectures, significantly enhancing the data efficiency and generalization capabilities of VLA models in complex, dynamic environments.

Technology Category

Application Category

📝 Abstract

Vision-language-action models have advanced robotic manipulation but remain constrained by reliance on the large, teleoperation-collected datasets dominated by the static, tabletop scenes. We propose a simulation-first framework to verify VLA architectures before real-world deployment and introduce MobileManiBench, a large-scale benchmark for mobile-based robotic manipulation. Built on NVIDIA Isaac Sim and powered by reinforcement learning, our pipeline autonomously generates diverse manipulation trajectories with rich annotations (language instructions, multi-view RGB-depth-segmentation images, synchronized object/robot states and actions). MobileManiBench features 2 mobile platforms (parallel-gripper and dexterous-hand robots), 2 synchronized cameras (head and right wrist), 630 objects in 20 categories, 5 skills (open, close, pull, push, pick) with over 100 tasks performed in 100 realistic scenes, yielding 300K trajectories. This design enables controlled, scalable studies of robot embodiments, sensing modalities, and policy architectures, accelerating research on data efficiency and generalization. We benchmark representative VLA models and report insights into perception, reasoning, and control in complex simulated environments.

Problem

Research questions and friction points this paper is trying to address.

mobile manipulation

vision-language-action models

model verification

robotic benchmark

teleoperation datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

simulation-first

vision-language-action models

mobile manipulation

autonomous data generation

multi-modal benchmark

🔎 Similar Papers

M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes

2024-10-09arXiv.orgCitations: 2