π€ AI Summary
Existing text-driven hand motion generation methods are constrained by studio-captured data, struggling to simultaneously achieve high motion fidelity and semantic alignment. To address this limitation, this work introduces 3D-HIW, the first large-scale in-the-wild dataset of paired 3D hand motions and textual descriptions, along with CLUTCH, a novel framework that leverages a pioneering SHIFT-based partially modality-disentangled VQ-VAE for efficient hand motion tokenization. CLUTCH integrates a vision-language model with a 3D hand tracker to establish an automated annotation pipeline and incorporates a geometric refinement stage to jointly optimize outputs from large language models. The proposed approach achieves state-of-the-art performance on both text-to-motion and motion-to-text tasks, establishing the first scalable benchmark for in-the-wild hand motion modeling.
π Abstract
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to"in-the-wild"settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce'3D Hands in the Wild'(3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.