Scaling Transformer-Based Novel View Synthesis Models with Token Disentanglement and Synthetic Data

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Transformer-based novel view synthesis (NVS) under sparse-view settings suffers from poor generalization, over-reliance on limited real-world data, and severe artifacts in synthetic data. Method: This paper proposes (1) leveraging diffusion models to generate high-fidelity, diverse synthetic training data to mitigate domain shift; and (2) introducing a token decoupling mechanism that explicitly separates geometric and appearance representations in feature space, thereby suppressing synthetic artifacts and enabling stable, scalable training of 3D-aware Transformers with large-scale synthetic data—first of its kind. The approach integrates self-supervised learning with disentangled representation learning. Contribution/Results: It significantly improves cross-domain generalization, achieves state-of-the-art performance across multiple benchmarks, delivers superior reconstruction quality, and attains higher training efficiency with reduced computational cost.

Technology Category

Application Category

📝 Abstract

Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs. Project page: https://scaling3dnvs.github.io/

Problem

Research questions and friction points this paper is trying to address.

Addressing limited diversity in real-world scene datasets

Mitigating artifacts from synthetic data generation

Enhancing feature separation in transformer-based models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token disentanglement for feature separation

Synthetic data from diffusion models

Enhanced transformer architecture for reconstruction

🔎 Similar Papers

No similar papers found.