Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between limited receptive fields in CNNs and high computational complexity (O(N²)) in Vision Transformers (ViTs) for facial attractiveness assessment, this paper proposes a CNN-SSM hybrid architecture. It employs a lightweight hierarchical convolutional backbone for local feature extraction and integrates a Mamba-inspired selective state space model (SSM) with gated mechanisms to efficiently capture long-range spatial dependencies via selective scanning. This design jointly respects human holistic perception and computational efficiency, marking the first integration of CNNs and selective SSMs for fine-grained aesthetic prediction. Evaluated on the SCUT-FBP5500 benchmark, the method achieves state-of-the-art performance: Pearson correlation coefficient of 0.9187, mean absolute error (MAE) of 0.2022, and root mean square error (RMSE) of 0.2610.

Technology Category

Application Category

📝 Abstract
The computational assessment of facial attractiveness, a challenging subjective regression task, is dominated by architectures with a critical trade-off: Convolutional Neural Networks (CNNs) offer efficiency but have limited receptive fields, while Vision Transformers (ViTs) model global context at a quadratic computational cost. To address this, we propose Mamba-CNN, a novel and efficient hybrid architecture. Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone. This core innovation allows the network to dynamically modulate feature maps and selectively emphasize salient facial features and their long-range spatial relationships, mirroring human holistic perception while maintaining computational efficiency. We conducted extensive experiments on the widely-used SCUT-FBP5500 benchmark, where our model sets a new state-of-the-art. Mamba-CNN achieves a Pearson Correlation (PC) of 0.9187, a Mean Absolute Error (MAE) of 0.2022, and a Root Mean Square Error (RMSE) of 0.2610. Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Balancing efficiency and accuracy in facial attractiveness prediction models
Integrating global context modeling without quadratic computational cost
Dynamic modulation of facial features and long-range spatial relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-CNN architecture for facial beauty prediction
Integrates lightweight Mamba SSM gating into CNN backbone
Dynamically modulates features and captures long-range relationships
🔎 Similar Papers
No similar papers found.