🤖 AI Summary
Existing full-reference image quality assessment (FR-IQA) methods suffer from heavy reliance on large-scale human annotations, high computational overhead, and poor deployability in latent-space optimization. To address these challenges, this paper proposes MILO—a lightweight, multi-scale-aware quality metric. MILO leverages a VAE encoder to extract latent representations and integrates multi-scale feature fusion, spatial masking, and curriculum learning, trained via pseudo-MOS supervision to eliminate the need for manual labeling. Its core innovations are the first incorporation of spatial masking and curriculum learning into latent-space quality modeling, significantly improving perceptual alignment and optimization efficiency. Experiments demonstrate that MILO surpasses state-of-the-art FR-IQA methods across major benchmarks, achieves real-time inference speed, and delivers substantial performance gains—under lower computational cost—in denoising, super-resolution, and face restoration tasks.
📝 Abstract
We present MILO (Metric for Image- and Latent-space Optimization), a lightweight, multiscale, perceptual metric for full-reference image quality assessment (FR-IQA). MILO is trained using pseudo-MOS (Mean Opinion Score) supervision, in which reproducible distortions are applied to diverse images and scored via an ensemble of recent quality metrics that account for visual masking effects. This approach enables accurate learning without requiring large-scale human-labeled datasets. Despite its compact architecture, MILO outperforms existing metrics across standard FR-IQA benchmarks and offers fast inference suitable for real-time applications. Beyond quality prediction, we demonstrate the utility of MILO as a perceptual loss in both image and latent domains. In particular, we show that spatial masking modeled by MILO, when applied to latent representations from a VAE encoder within Stable Diffusion, enables efficient and perceptually aligned optimization. By combining spatial masking with a curriculum learning strategy, we first process perceptually less relevant regions before progressively shifting the optimization to more visually distorted areas. This strategy leads to significantly improved performance in tasks like denoising, super-resolution, and face restoration, while also reducing computational overhead. MILO thus functions as both a state-of-the-art image quality metric and as a practical tool for perceptual optimization in generative pipelines.