🤖 AI Summary
This work addresses the challenging problem of hand mesh reconstruction from a single RGB image under severe occlusion (e.g., hand-object interaction). To this end, we introduce state space models (SSMs) to this task for the first time, proposing a state space-driven spatial-channel joint attention module that simultaneously captures long-range spatial dependencies and channel-wise responses, thereby significantly enlarging the effective receptive field. Our method employs a lightweight encoder-decoder architecture with end-to-end differentiable mesh regression, balancing accuracy and efficiency. Evaluated on strong-occlusion benchmarks—including FREIHAND, DEXYCB, and HO3D—our approach achieves state-of-the-art performance in both quantitative metrics and qualitative fidelity. It attains the smallest parameter count and fastest inference speed among comparable methods, while consistently producing complete, geometrically accurate, and detail-rich hand reconstructions.
📝 Abstract
Reconstructing the hand mesh from one single RGB image is a challenging task because hands are often occluded by other objects. Most previous works attempt to explore more additional information and adopt attention mechanisms for improving 3D reconstruction performance, while it would increase computational complexity simultaneously. To achieve a performance-reserving architecture with high computational efficiency, in this work, we propose a simple but effective 3D hand mesh reconstruction network (i.e., HandS3C), which is the first time to incorporate state space model into the task of hand mesh reconstruction. In the network, we design a novel state-space spatial-channel attention module that extends the effective receptive field, extracts hand features in the spatial dimension, and enhances regional features of hands in the channel dimension. This helps to reconstruct a complete and detailed hand mesh. Extensive experiments conducted on well-known datasets facing heavy occlusions (such as FREIHAND, DEXYCB, and HO3D) demonstrate that our proposed HandS3C achieves state-of-the-art performance while maintaining a minimal parameters.