MDVT: Enhancing Multimodal Recommendation with Model-Agnostic Multimodal-Driven Virtual Triplets

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient representation learning caused by data sparsity in multimodal recommendation, this paper proposes a model-agnostic multimodal-driven virtual triplet construction framework. It synthesizes high-quality virtual user-item interaction triplets leveraging multimodal features to provide strong supervisory signals. We introduce three novel warm-start thresholding strategies—static, dynamic, and hybrid—to balance accuracy and computational efficiency. Additionally, we design a gradient-free offset-enhanced pairwise loss function to improve optimization stability. The method requires no architectural modifications to backbone models and is fully compatible with mainstream multimodal recommendation frameworks. Extensive experiments on multiple real-world datasets demonstrate consistent performance gains; notably, under extremely sparse settings, AUC improves by up to 3.2%. Crucially, the approach preserves training stability while delivering significant improvements in recommendation accuracy and robustness.

Technology Category

Application Category

📝 Abstract
The data sparsity problem significantly hinders the performance of recommender systems, as traditional models rely on limited historical interactions to learn user preferences and item properties. While incorporating multimodal information can explicitly represent these preferences and properties, existing works often use it only as side information, failing to fully leverage its potential. In this paper, we propose MDVT, a model-agnostic approach that constructs multimodal-driven virtual triplets to provide valuable supervision signals, effectively mitigating the data sparsity problem in multimodal recommendation systems. To ensure high-quality virtual triplets, we introduce three tailored warm-up threshold strategies: static, dynamic, and hybrid. The static warm-up threshold strategy exhaustively searches for the optimal number of warm-up epochs but is time-consuming and computationally intensive. The dynamic warm-up threshold strategy adjusts the warm-up period based on loss trends, improving efficiency but potentially missing optimal performance. The hybrid strategy combines both, using the dynamic strategy to find the approximate optimal number of warm-up epochs and then refining it with the static strategy in a narrow hyper-parameter space. Once the warm-up threshold is satisfied, the virtual triplets are used for joint model optimization by our enhanced pair-wise loss function without causing significant gradient skew. Extensive experiments on multiple real-world datasets demonstrate that integrating MDVT into advanced multimodal recommendation models effectively alleviates the data sparsity problem and improves recommendation performance, particularly in sparse data scenarios.
Problem

Research questions and friction points this paper is trying to address.

Address data sparsity in multimodal recommendation systems
Enhance multimodal information usage beyond side information
Optimize virtual triplet quality with warm-up threshold strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Model-agnostic multimodal-driven virtual triplets
Three tailored warm-up threshold strategies
Enhanced pair-wise loss function optimization
🔎 Similar Papers
No similar papers found.
J
Jinfeng Xu
Department of Electrical and Electronic Engineering, The University of Hong Kong, HongKong SAR, China
Zheyu Chen
Zheyu Chen
PhD, Beijing Institute of Technology
Recommendation System
J
Jinze Li
Department of Electrical and Electronic Engineering, The University of Hong Kong, HongKong SAR, China
S
Shuo Yang
Department of Electrical and Electronic Engineering, The University of Hong Kong, HongKong SAR, China
Hewei Wang
Hewei Wang
Carnegie Mellon University / Apple
Machine LearningComputer VisionVision-Language ModelsGenerative Models
Y
Yijie Li
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, United States
Mengran Li
Mengran Li
Sun Yat-sen University
network scienceheterogeneous graphhypergraph
Puzhen Wu
Puzhen Wu
Cornell University
Medical AIBioinformatics
Edith C. H. Ngai
Edith C. H. Ngai
Associate Professor, Dept. of Electrical and Electronic Engineering, The University of Hong Kong
edge intelligenceInternet-of-Thingssmart citiessmart healthsecurity and privacy