MEGA: Masked Generative Autoencoder for Human Mesh Recovery

📅 2024-05-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the severe ambiguity inherent in human mesh recovery (HMR) from a single RGB image, this paper proposes the first end-to-end framework based on masked generative modeling. The core innovation lies in discretizing human pose and shape into sequence tokens and unifying reconstruction via masked autoencoding and a conditional Transformer—enabling both deterministic single-prediction and stochastic multi-sample generation. This work pioneers the application of the masked generative paradigm to HMR, uniquely balancing reconstruction fidelity and geometric diversity. Evaluated on challenging in-the-wild benchmarks (e.g., 3DPW, AGORA), our method achieves state-of-the-art performance: the deterministic mode surpasses all prior single-output approaches, while the stochastic mode significantly outperforms existing multi-output models. These results demonstrate dual advantages—superior ambiguity modeling and controllable generation—establishing a new paradigm for robust, flexible HMR.

Technology Category

Application Category

📝 Abstract

Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

Problem

Research questions and friction points this paper is trying to address.

Address ambiguity in Human Mesh Recovery from RGB images

Propose MEGA for deterministic and stochastic mesh generation

Achieve state-of-the-art performance in both generation modes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked generative modeling for human mesh recovery

Tokenization of human pose and shape

Flexible deterministic and stochastic mesh generation

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)