🤖 AI Summary
Active view selection for novel-view synthesis and 3D reconstruction remains challenging due to reliance on task-specific 3D representations and high computational overhead. Method: We propose a generic, efficient 2D-centric solution that reformulates 3D uncertainty modeling as a cross-reference image quality assessment task. Specifically, we predict SSIM scores for candidate views and select the one with the lowest predicted rendering quality as the next best view. For the first time, we adapt the CrossScore framework to this setting, integrating no-reference IQA models (e.g., MUSIQ, MANIQA) under an unsupervised cross-reference learning paradigm. Contribution/Results: Our approach is representation-agnostic—requiring no explicit 3D geometry or scene representation—thereby significantly improving generalizability and efficiency. On standard benchmarks, it achieves superior quantitative and qualitative performance over prior methods, while accelerating inference by 14–33× compared to state-of-the-art approaches.
📝 Abstract
We tackle active view selection in novel view synthesis and 3D reconstruction. Existing methods like FisheRF and ActiveNeRF select the next best view by minimizing uncertainty or maximizing information gain in 3D, but they require specialized designs for different 3D representations and involve complex modelling in 3D space. Instead, we reframe this as a 2D image quality assessment (IQA) task, selecting views where current renderings have the lowest quality. Since ground-truth images for candidate views are unavailable, full-reference metrics like PSNR and SSIM are inapplicable, while no-reference metrics, such as MUSIQ and MANIQA, lack the essential multi-view context. Inspired by a recent cross-referencing quality framework CrossScore, we train a model to predict SSIM within a multi-view setup and use it to guide view selection. Our cross-reference IQA framework achieves substantial quantitative and qualitative improvements across standard benchmarks, while being agnostic to 3D representations, and runs 14-33 times faster than previous methods.