WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing methods struggle with cross-scene generalizable scene-level novel view synthesis (NVS) from in-the-wild multi-view images—characterized by illumination variations, occlusions, unknown camera parameters, and licensing restrictions. To address this, we propose the first multi-view diffusion model that explicitly models global appearance conditioning, enabling end-to-end training on heterogeneous single-image–multi-view weakly supervised data. Our approach integrates appearance-aware conditional encoding with geometry-semantic consistency constraints, eliminating the need for precise camera calibration or clean multi-view correspondences. Evaluated on both object-level and scene-level single-image NVS benchmarks, our method achieves state-of-the-art performance while requiring significantly fewer data sources. Moreover, it enables controllable global appearance editing during inference—e.g., lighting, color, or material modulation—without retraining.

Technology Category

Application Category

📝 Abstract

Despite recent advances in sparse novel view synthesis (NVS) applied to object-centric scenes, scene-level NVS remains a challenge. A central issue is the lack of available clean multi-view training data, beyond manually curated datasets with limited diversity, camera variation, or licensing issues. On the other hand, an abundance of diverse and permissively-licensed data exists in the wild, consisting of scenes with varying appearances (illuminations, transient occlusions, etc.) from sources such as tourist photos. To this end, we present WildCAT3D, a framework for generating novel views of scenes learned from diverse 2D scene image data captured in the wild. We unlock training on these data sources by explicitly modeling global appearance conditions in images, extending the state-of-the-art multi-view diffusion paradigm to learn from scene views of varying appearances. Our trained model generalizes to new scenes at inference time, enabling the generation of multiple consistent novel views. WildCAT3D provides state-of-the-art results on single-view NVS in object- and scene-level settings, while training on strictly less data sources than prior methods. Additionally, it enables novel applications by providing global appearance control during generation.

Problem

Research questions and friction points this paper is trying to address.

Addresses scene-level novel view synthesis challenges

Leverages diverse 2D wild data with appearance variations

Enables appearance-controlled multi-view generation from sparse inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diverse 2D scene image data

Models global appearance conditions explicitly

Enables global appearance control during generation

🔎 Similar Papers

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models