RayZer: A Self-supervised Large View Synthesis Model

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses multi-view 3D perception and novel view synthesis without 3D supervision—i.e., in the absence of ground-truth camera poses or scene geometry annotations. We propose a fully self-supervised paradigm: a Transformer architecture that explicitly decouples camera parameter estimation from scene representation learning, using only ray geometry as its sole 3D prior to enable 3D-aware self-encoding. By integrating NeRF-inspired volume rendering with autoregressive camera pose estimation, our framework jointly optimizes uncalibrated, unregistered images for camera extrinsics, implicit scene representations, and photorealistic novel views. Experiments demonstrate that our method matches or surpasses supervised approaches relying on ground-truth poses in novel view synthesis quality. Crucially, it is the first to empirically verify that geometric consistency and emergent 3D awareness can arise purely from self-supervision—without any explicit 3D labels—establishing a new paradigm for unsupervised 3D vision.

Technology Category

Application Category

📝 Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing. Project: https://hwjiang1510.github.io/RayZer/
Problem

Research questions and friction points this paper is trying to address.

Self-supervised 3D view synthesis without camera poses
Recovers camera parameters from unposed images
Novel view synthesis outperforms pose-annotation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised multi-view 3D model without 3D supervision
Recovers camera parameters and synthesizes novel views
Transformer-based model with ray structure as 3D prior
🔎 Similar Papers
No similar papers found.