🤖 AI Summary
Data inefficiency hinders language-driven pick-and-place in complex indoor mobile manipulation (MoMa) tasks. Method: We introduce λ, the first long-horizon MoMa benchmark explicitly designed for data-efficiency evaluation. Built upon 571 real human demonstrations spanning cross-room and cross-floor scenarios, λ supports both simulation and real-world transfer. We propose a lightweight, high-fidelity benchmark construction paradigm centered on human demonstrations, and a neuro-symbolic modular architecture integrating foundation models, symbolic task planning, and motion planning—evaluated against behavioral cloning and reinforcement learning baselines. Contribution/Results: Pure learning methods exhibit poor data efficiency; in contrast, our neuro-symbolic approach achieves over 40% absolute success rate improvement with only a few demonstrations, significantly enhancing robustness. λ has become the community standard for evaluating data efficiency in MoMa.
📝 Abstract
Efficiently learning and executing long-horizon mobile manipulation (MoMa) tasks is crucial for advancing robotics in household and workplace settings. However, current MoMa models are data-inefficient, underscoring the need for improved models that require realistic-sized benchmarks to evaluate their efficiency, which do not exist. To address this, we introduce the LAMBDA ({lambda}) benchmark (Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities), which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. The benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We benchmark several models, including learning-based models and a neuro-symbolic modular approach combining foundation models with task and motion planning. Learning-based models show suboptimal success rates, even when leveraging pretrained weights, underscoring significant data inefficiencies. However, the neuro-symbolic approach performs significantly better while being more data efficient. Findings highlight the need for more data-efficient learning-based MoMa approaches. {lambda} addresses this gap by serving as a key benchmark for evaluating the data efficiency of those future models in handling household robotics tasks.