M^2VAE: Multi-Modal Multi-View Variational Autoencoder for Cold-start Item Recommendation

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cold-start item recommendation suffers from insufficient interaction data for newly introduced items. While existing approaches incorporate multimodal content, they neglect the multi-view structural nature of modalities and fail to distinguish between shared and modality-specific features. This paper proposes the Multimodal Multi-View Variational Autoencoder (MMV-VAE), which jointly models attribute, category, and image modalities. It employs a Product-of-Experts mechanism to fuse latent variables across multiple views, designs a disentangled contrastive loss to explicitly separate shared and modality-specific representations, and introduces a preference-guided Mixture-of-Experts module for user-adaptive representation aggregation. Additionally, co-occurrence-based contrastive learning—requiring no pretraining—is leveraged to enhance generalization. Extensive experiments on multiple real-world datasets demonstrate that MMV-VAE significantly outperforms state-of-the-art cold-start recommendation methods, achieving substantial improvements in both recommendation accuracy and cross-item generalization capability.

Technology Category

Application Category

📝 Abstract
Cold-start item recommendation is a significant challenge in recommendation systems, particularly when new items are introduced without any historical interaction data. While existing methods leverage multi-modal content to alleviate the cold-start issue, they often neglect the inherent multi-view structure of modalities, the distinction between shared and modality-specific features. In this paper, we propose Multi-Modal Multi-View Variational AutoEncoder (M^2VAE), a generative model that addresses the challenges of modeling common and unique views in attribute and multi-modal features, as well as user preferences over single-typed item features. Specifically, we generate type-specific latent variables for item IDs, categorical attributes, and image features, and use Product-of-Experts (PoE) to derive a common representation. A disentangled contrastive loss decouples the common view from unique views while preserving feature informativeness. To model user inclinations, we employ a preference-guided Mixture-of-Experts (MoE) to adaptively fuse representations. We further incorporate co-occurrence signals via contrastive learning, eliminating the need for pretraining. Extensive experiments on real-world datasets validate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Addresses cold-start item recommendation without historical data
Models common and unique views in multi-modal features
Captures user preferences over single-typed item features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Multi-View VAE for cold-start recommendation
Product-of-Experts for common representation learning
Preference-guided Mixture-of-Experts for adaptive fusion
🔎 Similar Papers
C
Chuan He
Ant Group
Y
Yongchao Liu
Ant Group
Q
Qiang Li
Zhejiang University of Technology
Wenliang Zhong
Wenliang Zhong
University of Science and Technology Beijing
OptimizationResources Allocation
C
Chuntao Hong
Ant Group
X
Xinwei Yao
Zhejiang University of Technology