🤖 AI Summary
This work addresses the cross-modal generation problem of synthesizing 3D hand motions from natural language descriptions—encompassing hand shapes, spatial positions, and finger/arm dynamics. To overcome the scarcity of annotated text-motion pairs, we propose HandMDM, the first text-to-3D-hand-motion diffusion model. Our method leverages large-scale sign language videos and large language models, augmented with a sign-language attribute lexicon and motion script cues, to automatically generate high-quality pseudo-labeled text-motion data. We then train a text-conditioned diffusion model on this data. HandMDM achieves strong cross-domain generalization—uniquely supporting unseen sign classes, heterogeneous sign language systems, and non-sign gestures—while producing high-fidelity, temporally coherent 3D hand motions across diverse scenarios. To foster research in embodied interaction and sign language technology, we will publicly release the code, model, and dataset.
📝 Abstract
Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.