CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Generative recommendation faces two key bottlenecks: shallow integration of collaborative signals and overly rigid coupling of multimodal features, limiting comprehensive item representation. To address these, we propose an end-to-end multimodal generative recommendation framework that— for the first time—dynamically guides multimodal fusion using collaborative signals. We introduce a Residual Quantized Variational Autoencoder (RQ-VAE) to jointly encode user–item collaborative semantics and visual/textual features into unified discrete latent codes, bridging collaborative filtering with large language model (LLM)-based generation. Our framework jointly models multimodal fusion, RQ-VAE encoding, LLM fine-tuning, and autoregressive discrete code generation. Extensive experiments on multiple public benchmarks demonstrate substantial improvements over state-of-the-art generative and discriminative methods, achieving up to a 23.6% gain in Recall@10.

Technology Category

Application Category

📝 Abstract

Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Integrates collaborative signals with multimodal features dynamically

Converts fused multimodal data into discrete semantic codes

Uses a language model to generate item codes for recommendation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically fuses multimodal features using collaborative signals.

Employs Residual Quantization VAE for discrete semantic tokenization.

Fine-tunes large language model for autoregressive item code generation.

🔎 Similar Papers

No similar papers found.