UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches treat hand motion modeling as separate estimation and generation tasks, struggling to integrate heterogeneous conditional signals and suffering significant performance degradation under occlusion or missing inputs. This work proposes a unified latent diffusion framework that reframes both tasks as conditional synthesis problems. By leveraging a shared latent space, the model effectively fuses multimodal inputs, while a hand-aware encoder coupled with a frozen vision backbone directly extracts image features—eliminating the need for complex preprocessing. To our knowledge, this is the first method to unify hand motion estimation and generation, enabling cross-task knowledge transfer and supporting diverse heterogeneous conditioning signals. Experiments demonstrate robust performance across multiple benchmarks, maintaining high accuracy even under severe occlusion and temporally incomplete inputs.

Technology Category

Application Category

📝 Abstract
Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
Problem

Research questions and friction points this paper is trying to address.

4D hand motion modeling
hand motion estimation
hand motion generation
occlusion robustness
heterogeneous conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified hand motion modeling
diffusion-based generation
heterogeneous condition fusion
hand perceptron
latent diffusion model
🔎 Similar Papers
No similar papers found.
Zhihao Sun
Zhihao Sun
Fudan University
Computer Vision
Tong Wu
Tong Wu
Stanford University
Computer VisionDeep Learning
R
Ruirui Tu
Institute of Trustworthy Embodied AI (TEAI), Fudan University
D
Daoguo Dong
Institute of Trustworthy Embodied AI (TEAI), Fudan University
Zuxuan Wu
Zuxuan Wu
Fudan University