EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing image completion and novel view synthesis methods typically rely on multi-stage pipelines that neglect cross-view dependencies, resulting in high computational and memory overhead. This paper introduces the first end-to-end diffusion model unifying zero-shot novel view synthesis and occlusion-free region completion. By proposing input-level and feature-level masked fine-tuning strategies, we jointly model both tasks for the first time, significantly improving cross-view consistency. Furthermore, our method integrates feedforward image-to-mesh reconstruction, enabling seamless embedding into existing 3D pipelines without additional training. Evaluated under 10-input-view settings, our approach achieves a +3.9 PSNR gain and +0.28 volumetric IoU improvement in occluded scenes, while reducing reconstruction time by 95%. Crucially, it demonstrates strong generalization to real-world scenarios.

Technology Category

Application Category

📝 Abstract

We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method's scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.

Problem

Research questions and friction points this paper is trying to address.

Simultaneous amodal completion and novel view synthesis

Efficient 3D reconstruction with reduced computation time

Improved performance on occluded object reconstruction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked fine-tuning for end-to-end view synthesis

Enhanced feed-forward 3D reconstruction integration

Scalable novel view synthesis with amodal completion

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View