PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of existing unified 3D understanding and generation approaches that enforce a single autoregressive (AR) paradigm, leading to performance degradation, quantization artifacts, and high training costs. We propose the first unified 3D framework that synergistically integrates autoregressive and diffusion mechanisms: AR modeling is employed for 3D understanding tasks, while continuous diffusion enables high-fidelity 3D generation. A lightweight Transformer bridges features from large language models to the 3D diffusion conditional space, facilitating efficient cross-modal alignment. By avoiding the performance penalties of architectural homogenization and effectively reusing pretrained models to reduce computational overhead, our method achieves state-of-the-art results across multiple benchmarks for 3D understanding, generation, and editing, demonstrating the promise of an AR–diffusion collaborative paradigm for building general-purpose 3D intelligent systems.

Technology Category

Application Category

📝 Abstract

The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.

Problem

Research questions and friction points this paper is trying to address.

3D understanding

3D generation

unified framework

autoregression

diffusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregression

diffusion

3D generation