FastHMR: Accelerating Human Mesh Recovery via Token and Layer Merging with Diffusion Decoding

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the high computational cost and redundancy of existing Transformer-based 3D human mesh recovery (HMR) models, this paper proposes an efficient lightweight framework. First, we design an error-constrained layer fusion mechanism and a mask-guided image token merging strategy to substantially reduce model depth and sequence length. Second, we introduce a diffusion-based temporal decoder that explicitly models large-scale human motion priors to compensate for accuracy degradation induced by architectural simplification. The method preserves pose priors and contextual modeling capability while achieving up to 2.3× inference speedup; it slightly outperforms strong baselines in MPJPE on multiple benchmarks. Our core contribution is the first integration of joint layer-and-token compression with diffusion-based decoding for HMR—effectively balancing efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Recent transformer-based models for 3D Human Mesh Recovery (HMR) have achieved strong performance but often suffer from high computational cost and complexity due to deep transformer architectures and redundant tokens. In this paper, we introduce two HMR-specific merging strategies: Error-Constrained Layer Merging (ECLM) and Mask-guided Token Merging (Mask-ToMe). ECLM selectively merges transformer layers that have minimal impact on the Mean Per Joint Position Error (MPJPE), while Mask-ToMe focuses on merging background tokens that contribute little to the final prediction. To further address the potential performance drop caused by merging, we propose a diffusion-based decoder that incorporates temporal context and leverages pose priors learned from large-scale motion capture datasets. Experiments across multiple benchmarks demonstrate that our method achieves up to 2.3x speed-up while slightly improving performance over the baseline.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost of 3D human mesh recovery models

Merges redundant transformer layers and background tokens

Maintains accuracy using diffusion decoding with temporal context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merges transformer layers with minimal error impact

Combines redundant background tokens using mask guidance

Employs diffusion decoder with temporal context integration

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos