🤖 AI Summary
Existing image completion and novel view synthesis methods typically rely on multi-stage pipelines that neglect cross-view dependencies, resulting in high computational and memory overhead. This paper introduces the first end-to-end diffusion model unifying zero-shot novel view synthesis and occlusion-free region completion. By proposing input-level and feature-level masked fine-tuning strategies, we jointly model both tasks for the first time, significantly improving cross-view consistency. Furthermore, our method integrates feedforward image-to-mesh reconstruction, enabling seamless embedding into existing 3D pipelines without additional training. Evaluated under 10-input-view settings, our approach achieves a +3.9 PSNR gain and +0.28 volumetric IoU improvement in occluded scenes, while reducing reconstruction time by 95%. Crucially, it demonstrates strong generalization to real-world scenarios.
📝 Abstract
We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method's scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.