🤖 AI Summary
To address the trade-off between limited receptive fields in CNNs and high computational complexity (O(N²)) in Vision Transformers (ViTs) for facial attractiveness assessment, this paper proposes a CNN-SSM hybrid architecture. It employs a lightweight hierarchical convolutional backbone for local feature extraction and integrates a Mamba-inspired selective state space model (SSM) with gated mechanisms to efficiently capture long-range spatial dependencies via selective scanning. This design jointly respects human holistic perception and computational efficiency, marking the first integration of CNNs and selective SSMs for fine-grained aesthetic prediction. Evaluated on the SCUT-FBP5500 benchmark, the method achieves state-of-the-art performance: Pearson correlation coefficient of 0.9187, mean absolute error (MAE) of 0.2022, and root mean square error (RMSE) of 0.2610.
📝 Abstract
The computational assessment of facial attractiveness, a challenging subjective regression task, is dominated by architectures with a critical trade-off: Convolutional Neural Networks (CNNs) offer efficiency but have limited receptive fields, while Vision Transformers (ViTs) model global context at a quadratic computational cost. To address this, we propose Mamba-CNN, a novel and efficient hybrid architecture. Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone. This core innovation allows the network to dynamically modulate feature maps and selectively emphasize salient facial features and their long-range spatial relationships, mirroring human holistic perception while maintaining computational efficiency. We conducted extensive experiments on the widely-used SCUT-FBP5500 benchmark, where our model sets a new state-of-the-art. Mamba-CNN achieves a Pearson Correlation (PC) of 0.9187, a Mean Absolute Error (MAE) of 0.2022, and a Root Mean Square Error (RMSE) of 0.2610. Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.