🤖 AI Summary
To address the limitations of existing 3D object removal methods—namely, their reliance on initial 3D reconstruction, multi-view geometry priors, and inconsistent appearance modeling—this paper proposes the first reconstruction-free explicit inpainting framework. Methodologically: (1) it introduces an explicit prior alignment mechanism in pixel space to enforce cross-view geometric consistency; (2) it designs a scale-invariant depth loss to bypass scale and translation calibration inherent in monocular depth estimation; and (3) it integrates a lightweight foundation model with multi-view consistency supervision. Experiments demonstrate that our approach achieves state-of-the-art performance in both reconstruction accuracy and view consistency. Moreover, it trains three times faster than the current fastest method, significantly reducing computational overhead and deployment complexity.
📝 Abstract
Recent advances in Novel View Synthesis (NVS) and 3D generation have significantly improved editing tasks, with a primary emphasis on maintaining cross-view consistency throughout the generative process. Contemporary methods typically address this challenge using a dual-strategy framework: performing consistent 2D inpainting across all views guided by embedded priors either explicitly in pixel space or implicitly in latent space; and conducting 3D reconstruction with additional consistency guidance. Previous strategies, in particular, often require an initial 3D reconstruction phase to establish geometric structure, introducing considerable computational overhead. Even with the added cost, the resulting reconstruction quality often remains suboptimal. In this paper, we present VEIGAR, a computationally efficient framework that outperforms existing methods without relying on an initial reconstruction phase. VEIGAR leverages a lightweight foundation model to reliably align priors explicitly in the pixel space. In addition, we introduce a novel supervision strategy based on scale-invariant depth loss, which removes the need for traditional scale-and-shift operations in monocular depth regularization. Through extensive experimentation, VEIGAR establishes a new state-of-the-art benchmark in reconstruction quality and cross-view consistency, while achieving a threefold reduction in training time compared to the fastest existing method, highlighting its superior balance of efficiency and effectiveness.