🤖 AI Summary
This work addresses the challenge of generating physically plausible, direction-controllable 3D full-body grasping motions for animation, mixed reality, and robotics. Existing methods struggle to jointly model hand–object–scene (e.g., container) interactions, suffering from uncontrolled grasp orientation, scene penetration, and inefficient optimization. To overcome these limitations, we propose the first generative framework incorporating early-stage geometric reasoning: (1) ray-casting and collision detection to model reachable grasp directions; (2) unified constraints on arm and palm orientation; (3) symmetric left/right-hand grasp synthesis; and (4) probabilistic directional sampling, geometry-aware conditional generation, and contact-consistent full-body optimization. Our method achieves significant improvements over state-of-the-art on GRAB and ReplicaGrasp benchmarks—yielding higher physical plausibility, success rates, faster inference, and lower computational overhead. Ablation studies confirm consistent gains from each component. Code and models will be publicly released.
📝 Abstract
Synthesizing 3D whole-bodies that realistically grasp objects is useful for animation, mixed reality, and robotics. This is challenging, because the hands and body need to look natural w.r.t. each other, the grasped object, as well as the local scene (i.e., a receptacle supporting the object). Only recent work tackles this, with a divide-and-conquer approach; it first generates a"guiding"right-hand grasp, and then searches for bodies that match this. However, the guiding-hand synthesis lacks controllability and receptacle awareness, so it likely has an implausible direction (i.e., a body can't match this without penetrating the receptacle) and needs corrections through major post-processing. Moreover, the body search needs exhaustive sampling and is expensive. These are strong limitations. We tackle these with a novel method called CWGrasp. Our key idea is that performing geometry-based reasoning"early on,"instead of"too late,"provides rich"control"signals for inference. To this end, CWGrasp first samples a plausible reaching-direction vector (used later for both the arm and hand) from a probabilistic model built via raycasting from the object and collision checking. Then, it generates a reaching body with a desired arm direction, as well as a"guiding"grasping hand with a desired palm direction that complies with the arm's one. Eventually, CWGrasp refines the body to match the"guiding"hand, while plausibly contacting the scene. Notably, generating already-compatible"parts"greatly simplifies the"whole."Moreover, CWGrasp uniquely tackles both right- and left-hand grasps. We evaluate on the GRAB and ReplicaGrasp datasets. CWGrasp outperforms baselines, at lower runtime and budget, while all components help performance. Code and models will be released.