🤖 AI Summary
Current NeRF image quality assessment lacks a single, cross-dataset robust metric, as NeRF-specific artifacts cause mainstream metrics to poorly correlate with human perception. To address this, we propose the first multi-metric evaluation framework integrating DISTS—based on deep feature similarity—and VMAF—based on multi-scale visual fidelity—systematically exploring various normalization strategies and linear/nonlinear fusion methods to establish an end-to-end assessment pipeline. Our key contribution is the novel synergistic integration of DISTS and VMAF for NeRF quality evaluation. Extensive validation on Synthetic and Outdoor datasets demonstrates that all three fusion configurations significantly outperform either individual metric, achieving an average SROCC improvement of 0.12. Moreover, the fused metrics exhibit enhanced cross-dataset generalizability and stronger consistency with subjective scores (measured by SROCC and PCC), establishing a more reliable and transferable benchmark for NeRF evaluation.
📝 Abstract
Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.