LTM3D: Bridging Token Spaces for Conditional 3D Generation with Auto-Regressive Diffusion Framework

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

In conditional 3D shape generation, existing methods struggle to jointly model continuous latent spaces and capture inter-token dependencies, while also failing to unify diverse 3D representations (e.g., SDF, point clouds, meshes, 3D Gaussians) and multimodal inputs (e.g., images, text). To address these challenges, we propose DiffAR—the first unified framework that synergistically models diffusion processes and autoregressive (AR) dependencies within a *discrete latent token space*. Its core innovations include: (1) a conditional distribution modeling paradigm, (2) a prefix-aligned learning mechanism, and (3) a reconstruction-guided sampling strategy—enabling deep token-level integration of diffusion and AR principles. DiffAR integrates masked autoencoders, latent-space diffusion models, prefix learning, and latent token reconstruction modules. Experiments demonstrate that DiffAR significantly improves prompt fidelity and geometric accuracy in image- and text-driven 3D generation, supports high-quality cross-representation and cross-modal synthesis, and exhibits strong generalization across diverse settings.

Technology Category

Application Category

📝 Abstract

We present LTM3D, a Latent Token space Modeling framework for conditional 3D shape generation that integrates the strengths of diffusion and auto-regressive (AR) models. While diffusion-based methods effectively model continuous latent spaces and AR models excel at capturing inter-token dependencies, combining these paradigms for 3D shape generation remains a challenge. To address this, LTM3D features a Conditional Distribution Modeling backbone, leveraging a masked autoencoder and a diffusion model to enhance token dependency learning. Additionally, we introduce Prefix Learning, which aligns condition tokens with shape latent tokens during generation, improving flexibility across modalities. We further propose a Latent Token Reconstruction module with Reconstruction-Guided Sampling to reduce uncertainty and enhance structural fidelity in generated shapes. Our approach operates in token space, enabling support for multiple 3D representations, including signed distance fields, point clouds, meshes, and 3D Gaussian Splatting. Extensive experiments on image- and text-conditioned shape generation tasks demonstrate that LTM3D outperforms existing methods in prompt fidelity and structural accuracy while offering a generalizable framework for multi-modal, multi-representation 3D generation.

Problem

Research questions and friction points this paper is trying to address.

Combining diffusion and auto-regressive models for 3D shape generation

Aligning condition tokens with shape latent tokens across modalities

Reducing uncertainty and enhancing structural fidelity in generated shapes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diffusion and auto-regressive models

Uses Prefix Learning for token alignment

Employs Reconstruction-Guided Sampling for fidelity

🔎 Similar Papers

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation