Scaling Sequence-to-Sequence Generative Neural Rendering

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses two key challenges in neural rendering: reliance on explicit 3D representations and scarcity of camera-annotated 3D data. To this end, we propose Kaleido—the first pure-decoder Transformer model that formulates 3D view synthesis as a sequence-to-sequence image generation task. Our core innovations are threefold: (1) treating 3D as a special case of video to unify object- and scene-level view synthesis; (2) eliminating explicit geometric or radiance field representations, instead directly modeling 6-DoF viewpoint transformations in pixel-sequence space via masked autoregressive modeling and refinement-flow Transformers; and (3) leveraging large-scale unlabeled video pretraining to drastically reduce dependence on 3D supervision. Experiments demonstrate state-of-the-art performance across multiple benchmarks: Kaleido achieves superior zero-shot generalization over existing generative methods in few-shot settings, and—uniquely among feedforward models—matches the quality of per-scene optimization approaches under many-view conditions.

Technology Category

Application Category

📝 Abstract

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

Problem

Research questions and friction points this paper is trying to address.

Performing generative view synthesis without explicit 3D representations

Generating multiple 6-DoF target views from reference views

Unifying 3D and video modeling within a single transformer framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative view synthesis without explicit 3D representations

Masked autoregressive framework for multi-view generation

Unified 3D and video modeling via rectified flow transformer

🔎 Similar Papers

No similar papers found.