SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work proposes SceneTok, a novel approach to 3D scene representation that overcomes the limitations of traditional methods relying on structured data or view-aligned fields, which struggle to balance compression efficiency and generative flexibility. SceneTok encodes multi-view 3D scenes into a set of lightweight, unstructured tokens that are decoupled from spatial grids and permutation-invariant. A lightweight rectified flow decoder enables high-quality novel view synthesis directly from this unstructured latent space, supporting fast rendering along arbitrary trajectories beyond the input views. The method achieves state-of-the-art reconstruction quality while surpassing existing approaches in compression ratio by one to three orders of magnitude. Furthermore, SceneTok generates high-fidelity 3D scenes in under five seconds, significantly improving the trade-off between quality and speed.

Technology Category

Application Category

📝 Abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

Problem

Research questions and friction points this paper is trying to address.

3D scene representation

scene compression

unstructured tokens

view synthesis

latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SceneTok

compressed token representation

permutation-invariant tokens