🤖 AI Summary
Existing methods for hand-object interaction (HOI) video generation suffer from fragmented modeling of pose, appearance, and dynamics, hindering high-fidelity and controllable synthesis. This work proposes the first unified Pose-Appearance-Motion generative framework that integrates sparse conditioning signals—including depth maps, segmentation masks, and keypoints—to enable end-to-end synthesis of high-resolution, temporally coherent HOI videos. Evaluated on the DexYCB and OAKINK2 datasets, the method substantially outperforms current baselines, achieving significant improvements in FVD and MPJPE metrics. Moreover, the generated synthetic data effectively enhances downstream hand pose estimation performance, demonstrating strong potential for simulation-to-reality (sim-to-real) transfer.
📝 Abstract
Hand-object interaction (HOI) reconstruction and synthesis are becoming central to embodied AI and AR/VR. Yet, despite rapid progress, existing HOI generation research remains fragmented across three disjoint tracks: (1) pose-only synthesis that predicts MANO trajectories without producing pixels; (2) single-image HOI generation that hallucinates appearance from masks or 2D cues but lacks dynamics; and (3) video generation methods that require both the entire pose sequence and the ground-truth first frame as inputs, preventing true sim-to-real deployment. Inspired by the philosophy of Joo et al. (2018), we think that HOI generation requires a unified engine that brings together pose, appearance, and motion within one coherent framework. Thus we introduce PAM: a Pose-Appearance-Motion Engine for controllable HOI video generation. The performance of our engine is validated by: (1) On DexYCB, we obtain an FVD of 29.13 (vs. 38.83 for InterDyn), and MPJPE of 19.37 mm (vs. 30.05 mm for CosHand), while generating higher-resolution 480x720 videos compared to 256x256 and 256x384 baselines. (2) On OAKINK2, our full multi-condition model improves FVD from 68.76 to 46.31. (3) An ablation over input conditions on DexYCB shows that combining depth, segmentation, and keypoints consistently yields the best results. (4) For a downstream hand pose estimation task using SimpleHand, augmenting training with 3,400 synthetic videos (207k frames) allows a model trained on only 50% of the real data plus our synthetic data to match the 100% real baseline.