UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing approaches treat hand motion modeling as separate estimation and generation tasks, struggling to integrate heterogeneous conditional signals and suffering significant performance degradation under occlusion or missing inputs. This work proposes a unified latent diffusion framework that reframes both tasks as conditional synthesis problems. By leveraging a shared latent space, the model effectively fuses multimodal inputs, while a hand-aware encoder coupled with a frozen vision backbone directly extracts image features—eliminating the need for complex preprocessing. To our knowledge, this is the first method to unify hand motion estimation and generation, enabling cross-task knowledge transfer and supporting diverse heterogeneous conditioning signals. Experiments demonstrate robust performance across multiple benchmarks, maintaining high accuracy even under severe occlusion and temporally incomplete inputs.

Technology Category

Application Category

📝 Abstract

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Problem

Research questions and friction points this paper is trying to address.

4D hand motion modeling

hand motion estimation

hand motion generation

occlusion robustness

heterogeneous conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified hand motion modeling

diffusion-based generation

heterogeneous condition fusion