HanDrawer: Leveraging Spatial Information to Render Realistic Hands Using a Conditional Diffusion Model in Single Stage

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image diffusion models suffer from persistent artifacts in hand generation—particularly incorrect finger counts and geometric distortions—hindering fine-grained gesture modeling. To address this, we propose HanDrawer, the first method embedding MANO hand mesh spatial structure and biomechanical constraints into a graph convolutional network (GCN), coupled with cross-modal cross-attention for precise alignment between hand spatial features and textual conditions. We further introduce position-preserving zero-padding (PPZP) to maintain feature-space consistency and incorporate a hand reconstruction loss to enforce region-specific focus. Evaluated on a cleaned and enhanced HaGRID dataset, our approach achieves state-of-the-art performance, significantly mitigating finger omission and distortion. Both qualitative and quantitative results demonstrate consistent superiority over prior methods.

Technology Category

Application Category

📝 Abstract
Although diffusion methods excel in text-to-image generation, generating accurate hand gestures remains a major challenge, resulting in severe artifacts, such as incorrect number of fingers or unnatural gestures. To enable the diffusion model to learn spatial information to improve the quality of the hands generated, we propose HanDrawer, a module to condition the hand generation process. Specifically, we apply graph convolutional layers to extract the endogenous spatial structure and physical constraints implicit in MANO hand mesh vertices. We then align and fuse these spatial features with other modalities via cross-attention. The spatially fused features are used to guide a single stage diffusion model denoising process for high quality generation of the hand region. To improve the accuracy of spatial feature fusion, we propose a Position-Preserving Zero Padding (PPZP) fusion strategy, which ensures that the features extracted by HanDrawer are fused into the region of interest in the relevant layers of the diffusion model. HanDrawer learns the entire image features while paying special attention to the hand region thanks to an additional hand reconstruction loss combined with the denoising loss. To accurately train and evaluate our approach, we perform careful cleansing and relabeling of the widely used HaGRID hand gesture dataset and obtain high quality multimodal data. Quantitative and qualitative analyses demonstrate the state-of-the-art performance of our method on the HaGRID dataset through multiple evaluation metrics. Source code and our enhanced dataset will be released publicly if the paper is accepted.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic hand gestures using diffusion models.
Improving spatial feature fusion for accurate hand generation.
Enhancing dataset quality for training and evaluation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph convolutional layers extract hand spatial structure.
Cross-attention aligns spatial features with other modalities.
Position-Preserving Zero Padding ensures accurate feature fusion.
🔎 Similar Papers