SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing 3D autoregressive generation methods struggle with excessively long sequences or ambiguous spatial ordering in high-resolution modeling due to the lack of efficient voxel tokenization strategies. This work proposes a geometry saliency-guided adaptive supervoxel partitioning approach that leverages saliency prediction and centroidal Voronoi tessellation to achieve shape-aware, deterministic spatial ordering. This strategy significantly compresses sequence length while preserving structural integrity. By integrating a supervoxel-based VAE with a fine-tuned multimodal large language model, the method reduces token sequences to just 12.8% of those from uniform voxelization on the Trellis-500K benchmark, achieving state-of-the-art generation quality and accelerating inference by an order of magnitude compared to prior approaches.

📝 Abstract

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

Problem

Research questions and friction points this paper is trying to address.

3D tokenization

autoregressive generation

spatial ordering

sequence redundancy

shape representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

supervoxel tokenization

adaptive 3D representation

autoregressive 3D generation