🤖 AI Summary
Existing approaches struggle to simultaneously achieve large-scale generation, high diversity, and high kinematic fidelity in mobile manipulation trajectories, limiting robotic deployment in unstructured environments. This work proposes AutoMoMa, a framework that unifies the mobile base, manipulator, and object into a single kinematic chain. By integrating articulated kinematic representations (AKR), GPU-accelerated parallel trajectory optimization, and physics-aware constraint validation, AutoMoMa efficiently generates coordinated whole-body motion trajectories. The method breaks the longstanding trade-off among scale, diversity, and fidelity, enabling co-planning across multiple robot morphologies and complex articulated objects. It produces 5,000 trajectories per GPU-hour, yielding a dataset of 500,000 trajectories across 330 scenes. Imitation learning policies trained on this data achieve approximately 80% task success with only tens of thousands of samples.
📝 Abstract
Robots deployed in unstructured environments must coordinate whole-body motion -- simultaneously moving a mobile base and arm -- to interact with the physical world. This coupled mobility and dexterity yields a state space that grows combinatorially with scene and object diversity, demanding datasets far larger than those sufficient for fixed-base manipulation. Yet existing acquisition methods, including teleoperation and planning, are either labor-intensive or computationally prohibitive at scale. The core bottleneck is the lack of a scalable pipeline for generating large-scale, physically valid, coordinated trajectory data across diverse embodiments and environments. Here we introduce AutoMoMa, a GPU-accelerated framework that unifies AKR modeling, which consolidates base, arm, and object kinematics into a single chain, with parallelized trajectory optimization. AutoMoMa achieves 5,000 episodes per GPU-hour (over $80\times$ faster than CPU-based baselines), producing a dataset of over 500k physically valid trajectories spanning 330 scenes, diverse articulated objects, and multiple robot embodiments. Prior datasets were forced to compromise on scale, diversity, or kinematic fidelity; AutoMoMa addresses all three simultaneously. Training downstream IL policies further reveals that even a single articulated-object task requires tens of thousands of demonstrations for SOTA methods to reach $\approx 80\%$ success, confirming that data scarcity -- not algorithmic limitations -- has been the binding constraint. AutoMoMa thus bridges high-performance planning and reliable IL-based control, providing the infrastructure previously missing for coordinated mobile manipulation research. By making large-scale, kinematically valid training data practical, AutoMoMa showcases generalizable whole-body robot policies capable of operating in the diverse, unstructured settings of the real world.