DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual complexity prediction bridges computer vision and human perception, yet the necessity of linguistic information remains unclear. This paper introduces DReX, a purely visual model that—without any textual modality—achieves human-perception alignment by fusing semantic representations from DINOv3 (ViT-S/16) and multi-scale convolutional features from ResNet-50 via a learnable attention mechanism. With only 1/21.5 the parameters of state-of-the-art multimodal approaches, DReX achieves a Pearson correlation coefficient of 0.9581 on the IC9600 benchmark, outperforming prior methods in RMSE, MAE, and Spearman rank correlation, while demonstrating strong cross-dataset generalization. Its core contributions are: (i) empirically validating that language is unnecessary for accurate visual complexity prediction; and (ii) establishing an efficient, lightweight, vision-only modeling paradigm that sets a new standard for parameter efficiency and perceptual fidelity.

Technology Category

Application Category

📝 Abstract
Visual complexity prediction is a fundamental problem in computer vision with applications in image compression, retrieval, and classification. Understanding what makes humans perceive an image as complex is also a long-standing question in cognitive science. Recent approaches have leveraged multimodal models that combine visual and linguistic representations, but it remains unclear whether language information is necessary for this task. We propose DReX (DINO-ResNet Fusion), a vision-only model that fuses self-supervised and convolutional representations through a learnable attention mechanism to predict image complexity. Our architecture integrates multi-scale hierarchical features from ResNet-50 with semantically rich representations from DINOv3 ViT-S/16, enabling the model to capture both low-level texture patterns and high-level semantic structure. DReX achieves state-of-the-art performance on the IC9600 benchmark (Pearson r = 0.9581), surpassing previous methods--including those trained on multimodal image-text data--while using approximately 21.5x fewer learnable parameters. Furthermore, DReX generalizes robustly across multiple datasets and metrics, achieving superior results on Pearson and Spearman correlation, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). Ablation and attention analyses confirm that DReX leverages complementary cues from both backbones, with the DINOv3 [CLS] token enhancing sensitivity to visual complexity. Our findings suggest that visual features alone can be sufficient for human-aligned complexity prediction and that, when properly fused, self-supervised transformers and supervised deep convolutional neural networks offer complementary and synergistic benefits for this task.
Problem

Research questions and friction points this paper is trying to address.

Predicting image complexity using vision-only fusion of representations
Determining whether language information is necessary for complexity assessment
Integrating self-supervised and convolutional features for human-aligned prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses self-supervised and convolutional representations via attention
Integrates multi-scale ResNet features with DINOv3 ViT semantics
Achieves state-of-the-art performance using vision-only architecture
🔎 Similar Papers
No similar papers found.