DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the degradation in 3D hand pose estimation accuracy caused by severe occlusion (e.g., hand overlap) in interactive scenarios, this paper proposes a novel state-space model–based framework. Methodologically, it introduces a deformable state scanning mechanism to adaptively sample features from locally occluded joints and integrates Mamba’s selective state modeling to efficiently fuse convolutional local features with global contextual cues—spanning inter-joint, inter-hand, and scene-level dependencies. This design overcomes the long-range dependency modeling limitations of conventional CNNs, significantly enhancing occlusion-resilient structural recovery. Experiments demonstrate state-of-the-art performance across five diverse benchmarks covering single-/dual-hand pose estimation, hand-object interaction, and RGB/depth multimodal settings. Our method outperforms advanced backbone networks—including VMamba and Spatial-Mamba—in accuracy while maintaining inference efficiency comparable to ResNet-50.

Technology Category

Application Category

📝 Abstract

Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Addresses severe occlusions in 3D hand pose estimation

Improves global context learning beyond CNN limitations

Enhances accuracy in hand interactions across diverse scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deformable Mamba uses state space modeling for feature extraction

Selective state modeling captures global context beyond standard convolution

Deformable scanning aggregates local features while preserving global cues

🔎 Similar Papers

HandS3C: 3D Hand Mesh Reconstruction with State Space Spatial Channel Attention from RGB images

2024-05-02IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 0

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)