FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

๐Ÿ“… 2025-10-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational cost and redundancy of existing Transformer-based 3D human mesh recovery (HMR) models, this paper proposes an efficient lightweight framework. First, we design an error-constrained layer fusion mechanism and a mask-guided image token merging strategy to substantially reduce model depth and sequence length. Second, we introduce a diffusion-based temporal decoder that explicitly models large-scale human motion priors to compensate for accuracy degradation induced by architectural simplification. The method preserves pose priors and contextual modeling capability while achieving up to 2.3ร— inference speedup; it slightly outperforms strong baselines in MPJPE on multiple benchmarks. Our core contribution is the first integration of joint layer-and-token compression with diffusion-based decoding for HMRโ€”effectively balancing efficiency and accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of 3D human mesh recovery models
Merges redundant transformer layers and background tokens
Maintains accuracy using diffusion decoding with temporal context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges transformer layers with minimal error impact
Combines redundant background tokens using mask guidance
Employs diffusion decoder with temporal context integration
๐Ÿ”Ž Similar Papers
No similar papers found.