๐ค AI Summary
Existing image quality assessment metrics are constrained by closed vocabularies and rigid parametric assumptions, limiting their ability to accurately evaluate high-quality generated images. This work proposes the APEX framework, which introducesโ for the first timeโthe parameter-free, assumption-free sliced Wasserstein distance into image quality evaluation. By leveraging dual foundation models, CLIP and DINOv2, APEX extracts open-vocabulary features and measures distributional similarity through projected analysis in a high-dimensional embedding space. This approach overcomes the expressivity and generalization limitations of conventional metrics, demonstrating exceptional stability and robustness across diverse visual degradation scenarios, cross-dataset evaluations, and out-of-domain data.
๐ Abstract
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.