🤖 AI Summary
To address the severe ambiguity inherent in human mesh recovery (HMR) from a single RGB image, this paper proposes the first end-to-end framework based on masked generative modeling. The core innovation lies in discretizing human pose and shape into sequence tokens and unifying reconstruction via masked autoencoding and a conditional Transformer—enabling both deterministic single-prediction and stochastic multi-sample generation. This work pioneers the application of the masked generative paradigm to HMR, uniquely balancing reconstruction fidelity and geometric diversity. Evaluated on challenging in-the-wild benchmarks (e.g., 3DPW, AGORA), our method achieves state-of-the-art performance: the deterministic mode surpasses all prior single-output approaches, while the stochastic mode significantly outperforms existing multi-output models. These results demonstrate dual advantages—superior ambiguity modeling and controllable generation—establishing a new paradigm for robust, flexible HMR.
📝 Abstract
Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.