Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient vision–text alignment and superficial multimodal fusion in social media popularity prediction, this paper proposes a Multi-level Prototype-enhanced Framework. The method innovatively integrates cross-modal attention—enhancing semantic alignment and structural modeling between images and text—with dual-granularity prompt learning: coarse-grained (category-level) and fine-grained (instance-level) prompts jointly optimize modality consistency and hierarchical association discovery. Additionally, a contrastive learning–driven hierarchical prototype network is introduced to improve class discriminability and cross-modal representation robustness. Evaluated on multiple mainstream benchmarks, the framework consistently outperforms existing state-of-the-art methods, achieving average accuracy gains of 3.2%–5.7%. It establishes a novel, interpretable, and scalable paradigm for multimodal social media analysis.

Technology Category

Application Category

📝 Abstract
Social Media Popularity Prediction is a complex multimodal task that requires effective integration of images, text, and structured information. However, current approaches suffer from inadequate visual-textual alignment and fail to capture the inherent cross-content correlations and hierarchical patterns in social media data. To overcome these limitations, we establish a multi-class framework , introducing hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment. Furthermore, we propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms, achieving precise multimodal representation through fine-grained category modeling. Experimental results demonstrate state-of-the-art performance on benchmark metrics, establishing new reference standards for multimodal social media analysis.
Problem

Research questions and friction points this paper is trying to address.

Inadequate visual-textual alignment in multimodal data
Failure to capture cross-content correlations and hierarchies
Need for precise multimodal representation through category modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical prototypes with contrastive learning for alignment
Dual-grained prompt learning for category modeling
Cross-modal attention mechanisms for precise representation
🔎 Similar Papers
No similar papers found.
A
Ao Zhou
State Key Laboratory for Novel Software Technology, Nanjing University
M
Mingsheng Tu
Chongqing University of Posts and Telecommunications
L
Luping Wang
Chongqing University of Posts and Telecommunications
T
Tenghao Sun
Chongqing University of Posts and Telecommunications
Z
Zifeng Cheng
State Key Laboratory for Novel Software Technology, Nanjing University
Y
Yafeng Yin
State Key Laboratory for Novel Software Technology, Nanjing University
Zhiwei Jiang
Zhiwei Jiang
Nanjing University
Natural Language Processing
Qing Gu
Qing Gu
Nanjing University