SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly modeling local facial details and global structural patterns in facial beauty prediction, this paper proposes MD-Net, a dual-stream collaborative network. MD-Net uniquely integrates the generative prior from a frozen pre-trained diffusion model (specifically, the U-Net encoder) with the linear-complexity long-range modeling capability of Vision Mamba, enabling multi-scale feature collaboration via cross-attention mechanisms. This design overcomes inherent limitations of CNNs—strong local inductive bias but poor global context awareness—and ViTs—comprehensive modeling capacity but high computational cost. Evaluated on the SCUT-FBP5500 benchmark, MD-Net achieves a new state-of-the-art Pearson correlation coefficient of 0.9235, demonstrating both the effectiveness and efficiency of jointly leveraging generative priors and state-space models for visual aesthetic assessment.

Technology Category

Application Category

📝 Abstract
The automated prediction of facial beauty is a benchmark task in affective computing that requires a sophisticated understanding of both local aesthetic details (e.g., skin texture) and global facial harmony (e.g., symmetry, proportions). Existing models, based on either Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), exhibit inherent architectural biases that limit their performance; CNNs excel at local feature extraction but struggle with long-range dependencies, while ViTs model global relationships at a significant computational cost. This paper introduces the extbf{Mamba-Diffusion Network (MD-Net)}, a novel dual-stream architecture that resolves this trade-off by delegating specialized roles to state-of-the-art models. The first stream leverages a frozen U-Net encoder from a pre-trained latent diffusion model, providing a powerful generative prior for fine-grained aesthetic qualities. The second stream employs a Vision Mamba (Vim), a modern state-space model, to efficiently capture global facial structure with linear-time complexity. By synergistically integrating these complementary representations through a cross-attention mechanism, MD-Net creates a holistic and nuanced feature space for prediction. Evaluated on the SCUT-FBP5500 benchmark, MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of extbf{0.9235} and demonstrating the significant potential of hybrid architectures that fuse generative and sequential modeling paradigms for complex visual assessment tasks.
Problem

Research questions and friction points this paper is trying to address.

Resolving trade-off between local feature extraction and global facial modeling
Overcoming limitations of CNNs and ViTs for facial beauty prediction
Integrating generative priors with state-space models for holistic assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses generative priors with state-space models
Uses dual-stream architecture with cross-attention
Combines diffusion U-Net encoder with Vision Mamba
🔎 Similar Papers
No similar papers found.