Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Facial beauty prediction (FBP) suffers from insufficient modeling of interactions between multi-scale local and global features. To address this, we propose a CNN-Transformer hybrid architecture: a front-end employs parallel multi-scale convolutions to extract fine-grained local features, while a back-end serializes features from all scales and feeds them into a Transformer encoder—introducing, for the first time, an explicit inter-scale interaction mechanism to model cross-granularity feature dependencies. This design synergistically leverages CNNs’ strong local inductive bias and Transformers’ capability for long-range dependency modeling. Evaluated on the SCUT-FBP5500 dataset, our model achieves a state-of-the-art Pearson correlation coefficient of 0.9187, demonstrating the effectiveness and advancement of explicit multi-scale interaction modeling for complex image regression tasks.

Technology Category

Application Category

📝 Abstract

Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers. The SIT first employs a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields. These multi-scale representations are then framed as a sequence and processed by a Transformer encoder, which explicitly models their interactions and contextual relationships via a self-attention mechanism. We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art. It achieves a Pearson Correlation of 0.9187, outperforming previous methods. Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP. The success of the SIT architecture highlights the potential of hybrid CNN-Transformer models for complex image regression tasks that demand a holistic, context-aware understanding.

Problem

Research questions and friction points this paper is trying to address.

Modeling complex interplay between local and global facial features

Overcoming fixed-scale processing limitations in CNN feature extraction

Capturing multi-scale feature interdependencies for beauty prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid CNN-Transformer model for multi-scale feature integration

Self-attention mechanism to model cross-scale feature relationships

Parallel convolutions capture facial features at varying scales

🔎 Similar Papers

FaceXFormer: A Unified Transformer for Facial Analysis