MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional mobile manipulation decouples navigation from manipulation, causing “reaching the target” to not guarantee “executable manipulation.” Method: We propose an affordance-grounded “last-meter” navigation localization paradigm and introduce MoMa-Kitchen—the first large-scale kitchen benchmark with >100K samples—jointly modeling navigation end-pose and manipulation affordance. We innovatively design affordance-grounded floor labels that generalize across robot morphologies, enabling a closed loop of simulation generation, RGB-D first-person data collection, and automated affordance annotation. Contribution/Results: We present NavAff, a lightweight model that significantly outperforms baselines on MoMa-Kitchen. Our results validate that affordance-driven navigation localization improves cross-platform manipulation transfer. MoMa-Kitchen and NavAff establish a scalable foundation for evaluation and training in tightly coupled navigation-manipulation for embodied AI.

Technology Category

Application Category

📝 Abstract
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce MoMa-Kitchen, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, NavAff, for navigation affordance grounding that demonstrates promising performance on the MoMa-Kitchen benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in embodied AI. Project page: href{https://momakitchen.github.io/}{https://momakitchen.github.io/}.
Problem

Research questions and friction points this paper is trying to address.

Bridging gap between navigation and manipulation in mobile robots.
Providing optimal positioning for seamless transition to manipulation.
Enhancing affordance-based final positioning for diverse robotic arms.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MoMa-Kitchen: 100k+ benchmark for navigation-manipulation integration
Automated pipeline simulates real-world kitchen scenarios
NavAff model for affordance-based navigation positioning
🔎 Similar Papers
No similar papers found.
Pingrui Zhang
Pingrui Zhang
Fudan University
roboticsembodied AIcomputer vision
Xianqiang Gao
Xianqiang Gao
PhD Student of University of Science and Technology of China, Shanghai AI Lab
Yuhan Wu
Yuhan Wu
Peking University, Ph.D. student in CS, yuhan.wu [at] pku.edu.cn My Chinese name is 吴钰晗
Data StructuresNetworkingBig Data
K
Kehui Liu
Shanghai AI Laboratory, Northwestern Polytechnical University
D
Dong Wang
Shanghai AI Laboratory
Z
Zhigang Wang
Shanghai AI Laboratory
B
Bin Zhao
Shanghai AI Laboratory, Northwestern Polytechnical University
Y
Yan Ding
Shanghai AI Laboratory
X
Xuelong Li
TeleAI, China Telecom Corp Ltd