🤖 AI Summary
This work addresses three core challenges in 3D autoregressive generation: (1) the inherent disorder of 3D data conflicting with sequential modeling assumptions, (2) distortion introduced by vector quantization during grid-based compression, and (3) inefficiency in scaling latent representations to high resolutions. To this end, we propose a synergistic architecture comprising a Pyramid Variational Autoencoder (Pyramid VAE) and a Cascaded Masked Autoregressive Transformer (Cascaded MAR). We introduce a novel randomized masking training scheme coupled with an unordered autoregressive denoising mechanism to explicitly encode permutation invariance of 3D structures. Additionally, we design a conditional enhancement cascaded training strategy enabling efficient, progressive latent-space upsampling. Evaluated on multiple 3D generation benchmarks, our method significantly outperforms state-of-the-art approaches—including diffusion-based Transformers—achieving superior generation fidelity, improved generalization across object categories, and enhanced scalability to high-resolution outputs.
📝 Abstract
Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).