WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of novel view synthesis in dynamic scenes, where concurrent camera and object motion violates multi-view consistency, leading to ghosting artifacts, geometric distortions, and unstable pose estimation in conventional methods. To overcome this, we propose a self-supervised framework that leverages a static renderer to extract residuals, which are used to identify transient regions and generate pseudo motion masks. These masks guide the model to prioritize static background reconstruction. Our approach is the first to enable large-scale dynamic scene synthesis without manual annotations, innovatively integrating residual-driven transient-aware mechanisms, motion estimation distillation, input token masking, and gradient gating. Evaluated on our newly curated real-world dynamic datasets, D-RE10ik and D-RE10K-iPhone, our method achieves state-of-the-art performance in both transient region removal and overall synthesis quality using a single feed-forward pass, outperforming existing optimization-based and feed-forward baselines.

Technology Category

Application Category

📝 Abstract
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
Problem

Research questions and friction points this paper is trying to address.

novel view synthesis
dynamic environments
multi-view consistency
transient regions
camera motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised
novel view synthesis
dynamic environments
motion masking
analysis-by-synthesis