WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of novel view synthesis in dynamic scenes, where concurrent camera and object motion violates multi-view consistency, leading to ghosting artifacts, geometric distortions, and unstable pose estimation in conventional methods. To overcome this, we propose a self-supervised framework that leverages a static renderer to extract residuals, which are used to identify transient regions and generate pseudo motion masks. These masks guide the model to prioritize static background reconstruction. Our approach is the first to enable large-scale dynamic scene synthesis without manual annotations, innovatively integrating residual-driven transient-aware mechanisms, motion estimation distillation, input token masking, and gradient gating. Evaluated on our newly curated real-world dynamic datasets, D-RE10ik and D-RE10K-iPhone, our method achieves state-of-the-art performance in both transient region removal and overall synthesis quality using a single feed-forward pass, outperforming existing optimization-based and feed-forward baselines.

Technology Category

Application Category

📝 Abstract

We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

Problem

Research questions and friction points this paper is trying to address.

novel view synthesis

dynamic environments

multi-view consistency

transient regions

camera motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised

novel view synthesis

dynamic environments