Robust Photo-Realistic Hand Gesture Generation: from Single View to Multiple View

πŸ“… 2025-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Single-view MANO rendering in high-fidelity hand gesture generation suffers from 3D topological information loss and difficulty modeling finger occlusions. Method: This paper proposes a multi-view prior-guided diffusion model, introducing MUFENβ€”the first six-directional (front/back/left/right/top/bottom) collaborative prior framework. It features a dual-stream UNet encoder and a bounding-box-aware multimodal feature fusion module to improve hand localization accuracy and occluded-region completion. Contribution/Results: Leveraging joint supervision from multi-view MANO meshes and fine-tuning the diffusion model, our approach achieves state-of-the-art performance in geometric consistency, texture realism, and reconstruction of complex interdigitated gestures. It significantly enhances 3D hand structural integrity and visual fidelity, outperforming existing methods in both quantitative metrics and qualitative realism.

Technology Category

Application Category

πŸ“ Abstract
High-fidelity hand gesture generation represents a significant challenge in human-centric generation tasks. Existing methods typically employ single-view 3D MANO mesh-rendered images prior to enhancing gesture generation quality. However, the complexity of hand movements and the inherent limitations of single-view rendering make it difficult to capture complete 3D hand information, particularly when fingers are occluded. The fundamental contradiction lies in the loss of 3D topological relationships through 2D projection and the incomplete spatial coverage inherent to single-view representations. Diverging from single-view prior approaches, we propose a multi-view prior framework, named Multi-Modal UNet-based Feature Encoder (MUFEN), to guide diffusion models in learning comprehensive 3D hand information. Specifically, we extend conventional front-view rendering to include rear, left, right, top, and bottom perspectives, selecting the most information-rich view combination as training priors to address occlusion completion. This multi-view prior with a dedicated dual stream encoder significantly improves the model's understanding of complete hand features. Furthermore, we design a bounding box feature fusion module, which can fuse the gesture localization features and gesture multi-modal features to enhance the location-awareness of the MUFEN features to the gesture-related features. Experiments demonstrate that our method achieves state-of-the-art performance in both quantitative metrics and qualitative evaluations.
Problem

Research questions and friction points this paper is trying to address.

Generating high-fidelity hand gestures from single-view limitations
Addressing occlusion and incomplete 3D hand information in rendering
Improving gesture feature learning via multi-view prior framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view prior framework enhances 3D hand information
Dual stream encoder improves complete hand feature understanding
Bounding box feature fusion module enhances location-awareness
πŸ”Ž Similar Papers
No similar papers found.