SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D autoregressive generation methods struggle with excessively long sequences or ambiguous spatial ordering in high-resolution modeling due to the lack of efficient voxel tokenization strategies. This work proposes a geometry saliency-guided adaptive supervoxel partitioning approach that leverages saliency prediction and centroidal Voronoi tessellation to achieve shape-aware, deterministic spatial ordering. This strategy significantly compresses sequence length while preserving structural integrity. By integrating a supervoxel-based VAE with a fine-tuned multimodal large language model, the method reduces token sequences to just 12.8% of those from uniform voxelization on the Trellis-500K benchmark, achieving state-of-the-art generation quality and accelerating inference by an order of magnitude compared to prior approaches.
📝 Abstract
Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.
Problem

Research questions and friction points this paper is trying to address.

3D tokenization
autoregressive generation
spatial ordering
sequence redundancy
shape representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

supervoxel tokenization
adaptive 3D representation
autoregressive 3D generation
saliency-guided tessellation
ordered sequence modeling
🔎 Similar Papers
No similar papers found.