MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

📅 2025-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in 3D autoregressive generation: (1) the inherent disorder of 3D data conflicting with sequential modeling assumptions, (2) distortion introduced by vector quantization during grid-based compression, and (3) inefficiency in scaling latent representations to high resolutions. To this end, we propose a synergistic architecture comprising a Pyramid Variational Autoencoder (Pyramid VAE) and a Cascaded Masked Autoregressive Transformer (Cascaded MAR). We introduce a novel randomized masking training scheme coupled with an unordered autoregressive denoising mechanism to explicitly encode permutation invariance of 3D structures. Additionally, we design a conditional enhancement cascaded training strategy enabling efficient, progressive latent-space upsampling. Evaluated on multiple 3D generation benchmarks, our method significantly outperforms state-of-the-art approaches—including diffusion-based Transformers—achieving superior generation fidelity, improved generalization across object categories, and enhanced scalability to high-resolution outputs.

Technology Category

Application Category

📝 Abstract
Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).
Problem

Research questions and friction points this paper is trying to address.

Unordered 3D data conflicts with sequential prediction
Compression loss in 3D mesh vector quantization
Lack of scaling strategies for high-resolution latent prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pyramid variational autoencoder for 3D generation
Cascaded masked auto-regressive transformer for upscaling
Random masking and denoising for unordered 3D data
🔎 Similar Papers
No similar papers found.