VM-BeautyNet: A Synergistic Ensemble of Vision Transformer and Mamba for Facial Beauty Prediction

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Facial Beauty Prediction (FBP) requires modeling subjective human aesthetic perception, yet conventional CNN-based approaches struggle to simultaneously achieve comprehensive global feature representation and computational efficiency. To address this, we propose a heterogeneous dual-backbone architecture integrating Vision Transformer (ViT) and Mamba: ViT captures holistic facial structural patterns, while Mamba—leveraging linear-time complexity—efficiently models long-range spatial dependencies. Feature-level fusion bridges their complementary strengths, and Grad-CAM is incorporated to enhance model interpretability. Evaluated on the SCUT-FBP5500 benchmark, our method achieves state-of-the-art performance with a Pearson correlation coefficient of 0.9212, MAE of 0.2085, and RMSE of 0.2698—significantly outperforming single-backbone baselines. This demonstrates that the ViT-Mamba heterogeneity effectively balances prediction accuracy, computational efficiency, and interpretability in aesthetic modeling.

Technology Category

Application Category

📝 Abstract

Facial Beauty Prediction (FBP) is a complex and challenging computer vision task, aiming to model the subjective and intricate nature of human aesthetic perception. While deep learning models, particularly Convolutional Neural Networks (CNNs), have made significant strides, they often struggle to capture the global, holistic facial features that are critical to human judgment. Vision Transformers (ViT) address this by effectively modeling long-range spatial relationships, but their quadratic complexity can be a bottleneck. This paper introduces a novel, heterogeneous ensemble architecture, extbf{VM-BeautyNet}, that synergistically fuses the complementary strengths of a Vision Transformer and a Mamba-based Vision model, a recent advancement in State-Space Models (SSMs). The ViT backbone excels at capturing global facial structure and symmetry, while the Mamba backbone efficiently models long-range dependencies with linear complexity, focusing on sequential features and textures. We evaluate our approach on the benchmark SCUT-FBP5500 dataset. Our proposed VM-BeautyNet achieves state-of-the-art performance, with a extbf{Pearson Correlation (PC) of 0.9212}, a extbf{Mean Absolute Error (MAE) of 0.2085}, and a extbf{Root Mean Square Error (RMSE) of 0.2698}. Furthermore, through Grad-CAM visualizations, we provide interpretability analysis that confirms the complementary feature extraction of the two backbones, offering new insights into the model's decision-making process and presenting a powerful new architectural paradigm for computational aesthetics.

Problem

Research questions and friction points this paper is trying to address.

Predicting facial beauty using global and sequential features

Overcoming computational complexity in holistic facial analysis

Ensembling Vision Transformer and Mamba for aesthetic assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Vision Transformer and Mamba model ensemble

ViT captures global facial structure and symmetry

Mamba models long-range dependencies with linear complexity

🔎 Similar Papers

MambaVision: A Hybrid Mamba-Transformer Vision Backbone