Cross-Modal Scene Semantic Alignment for Image Complexity Assessment

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image complexity assessment (ICA) methods predominantly rely on unimodal visual features, limiting their ability to capture human subjective perception and scene-level semantic diversity. To address this, we propose the first cross-modal ICA framework that explicitly incorporates scene semantics via a dual-branch vision–text architecture: (1) a complexity regression branch for direct prediction, and (2) a scene semantic alignment branch trained on image–text pairs, leveraging pretrained vision–language models to guide complexity estimation with cross-modal semantic priors. Crucially, the semantic alignment mechanism enhances the model’s capacity to represent perception-relevant semantics. Extensive experiments on multiple public ICA benchmarks demonstrate significant improvements over state-of-the-art methods, validating that cross-modal semantic alignment substantially enhances both consistency and robustness in complexity assessment.

Technology Category

Application Category

📝 Abstract
Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
Problem

Research questions and friction points this paper is trying to address.

Assessing image complexity using cross-modal scene semantic alignment
Improving image complexity predictions to match human perception
Leveraging text prompts to enhance visual complexity understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages cross-modal scene semantic alignment
Aligns images with text prompts pair-wise
Guides complexity regression via semantic alignment
🔎 Similar Papers
No similar papers found.
Y
Yuqing Luo
School of Computer Science, Cardiff University, United Kingdom
Yixiao Li
Yixiao Li
Georgia Institute of Technology
Machine Learning
J
Jiang Liu
School of Computer Science, Cardiff University, United Kingdom
J
Jun Fu
School of Computer Science, Cardiff University, United Kingdom
Hadi Amirpour
Hadi Amirpour
University of Klagenfurt
Video CompressionQuality of ExperienceVideo StreamingMedical Image Processing
G
Guanghui Yue
School of Biomedical Engineering, Shenzhen University, China
Baoquan Zhao
Baoquan Zhao
Sun Yat-sen University
3D point cloud processing and compressionMultimedia content analysisOpen Educational Resources
Padraig Corcoran
Padraig Corcoran
Cardiff University
Network ScienceOperations ResearchFinTech
Hantao Liu
Hantao Liu
Full Professor of Computer Science, Cardiff University
Artificial IntelligenceImage and Video ProcessingApplied PerceptionMedical Imaging
W
Wei Zhou
School of Computer Science, Cardiff University, United Kingdom