True Self-Supervised Novel View Synthesis is Transferable

πŸ“… 2025-10-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
The core challenge in novel view synthesis (NVS) lies in whether models learn geometrically consistent camera pose representations. This paper introduces *transferability* as a new fundamental criterion for NVS: the latent poses learned from one video must be directly reusable to render novel trajectories in unseen videos. To this end, we propose XFactorβ€”the first fully self-supervised NVS framework that requires neither SE(3) parameterization, multi-view geometric priors, nor explicit 3D supervision. Its design centers on pairwise image-based pose estimation, content-motion disentangled input-output augmentation, and a geometry-agnostic Transformer architecture. Probe experiments confirm strong correlation between the learned latent poses and ground-truth camera trajectories. Large-scale evaluation demonstrates that XFactor significantly outperforms prior pose-free methods, while a newly proposed transferability metric empirically validates the cross-scene generalizability of its latent pose representations.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.
Problem

Research questions and friction points this paper is trying to address.

Achieving transferable pose representation across different video sequences
Disentangling camera pose from scene content without 3D biases
Quantifying transferability in self-supervised novel view synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-free self-supervised novel view synthesis model
Pair-wise pose estimation with input-output augmentation
Transferable latent pose variables without 3D inductive biases
πŸ”Ž Similar Papers
No similar papers found.
T
Thomas W. Mitchel
PlayStation
Hyunwoo Ryu
Hyunwoo Ryu
MIT
Artificial IntelligenceRobotics
V
Vincent Sitzmann
MIT CSAIL