🤖 AI Summary
Existing methods for generating full-body motion in 3D scenes struggle to simultaneously ensure physical plausibility and fine-grained grasp fidelity: scene-aware models neglect hand-object interaction, while grasp-specific models disregard environmental context. To address this, we introduce MOGRAS—the first large-scale benchmark featuring walking trajectories, multi-view grasp poses, and semantically annotated 3D scenes. We further propose a scene-adaptive generative framework that jointly optimizes full-body motion and object grasping by integrating kinematic constraints with geometric and functional scene cues. Our approach significantly improves physical plausibility—reducing interpenetration collisions by 42.7%—and enhances visual realism, achieving a +38.5% user preference rate in perceptual studies. Quantitative evaluations and qualitative analyses consistently demonstrate superiority over state-of-the-art methods.
📝 Abstract
Generating realistic full-body motion interacting with objects is critical for applications in robotics, virtual reality, and human-computer interaction. While existing methods can generate full-body motion within 3D scenes, they often lack the fidelity for fine-grained tasks like object grasping. Conversely, methods that generate precise grasping motions typically ignore the surrounding 3D scene. This gap, generating full-body grasping motions that are physically plausible within a 3D scene, remains a significant challenge. To address this, we introduce MOGRAS (Human MOtion with GRAsping in 3D Scenes), a large-scale dataset that bridges this gap. MOGRAS provides pre-grasping full-body walking motions and final grasping poses within richly annotated 3D indoor scenes. We leverage MOGRAS to benchmark existing full-body grasping methods and demonstrate their limitations in scene-aware generation. Furthermore, we propose a simple yet effective method to adapt existing approaches to work seamlessly within 3D scenes. Through extensive quantitative and qualitative experiments, we validate the effectiveness of our dataset and highlight the significant improvements our proposed method achieves, paving the way for more realistic human-scene interactions.