🤖 AI Summary
Existing NVS evaluation metrics inadequately balance perceptual realism and geometric fidelity under viewpoint transformation, exhibiting low correlation with human preferences. To address this, we propose PRISM—a task-aware evaluation framework for novel view synthesis. PRISM extracts semantic features from Zero123 and enhances discriminative capability via lightweight fine-tuning. We introduce two complementary metrics: D_PRISM (reference-dependent), quantifying local structural consistency, and MMD_PRISM (reference-free), measuring global distribution alignment. Evaluated on Toys4K, GSO, and OmniObject3D, MMD_PRISM achieves robust model ranking, where lower scores consistently correlate with superior NVS performance. Crucially, PRISM significantly improves agreement with human judgments—achieving an average 18.7% increase in Spearman’s ρ—while offering reliability, generality, and interpretability. This work establishes a principled, human-aligned evaluation paradigm for NVS.
📝 Abstract
The goal of Novel View Synthesis (NVS) is to generate realistic images of a given content from unseen viewpoints. But how can we trust that a generated image truly reflects the intended transformation? Evaluating its reliability remains a major challenge. While recent generative models, particularly diffusion-based approaches, have significantly improved NVS quality, existing evaluation metrics struggle to assess whether a generated image is both realistic and faithful to the source view and intended viewpoint transformation. Standard metrics, such as pixel-wise similarity and distribution-based measures, often mis-rank incorrect results as they fail to capture the nuanced relationship between the source image, viewpoint change, and generated output. We propose a task-aware evaluation framework that leverages features from a strong NVS foundation model, Zero123, combined with a lightweight tuning step to enhance discrimination. Using these features, we introduce two complementary evaluation metrics: a reference-based score, $D_{ ext{PRISM}}$, and a reference-free score, $ ext{MMD}_{ ext{PRISM}}$. Both reliably identify incorrect generations and rank models in agreement with human preference studies, addressing a fundamental gap in NVS evaluation. Our framework provides a principled and practical approach to assessing synthesis quality, paving the way for more reliable progress in novel view synthesis. To further support this goal, we apply our reference-free metric to six NVS methods across three benchmarks: Toys4K, Google Scanned Objects (GSO), and OmniObject3D, where $ ext{MMD}_{ ext{PRISM}}$ produces a clear and stable ranking, with lower scores consistently indicating stronger models.