CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generative recommendation faces two key bottlenecks: shallow integration of collaborative signals and overly rigid coupling of multimodal features, limiting comprehensive item representation. To address these, we propose an end-to-end multimodal generative recommendation framework that— for the first time—dynamically guides multimodal fusion using collaborative signals. We introduce a Residual Quantized Variational Autoencoder (RQ-VAE) to jointly encode user–item collaborative semantics and visual/textual features into unified discrete latent codes, bridging collaborative filtering with large language model (LLM)-based generation. Our framework jointly models multimodal fusion, RQ-VAE encoding, LLM fine-tuning, and autoregressive discrete code generation. Extensive experiments on multiple public benchmarks demonstrate substantial improvements over state-of-the-art generative and discriminative methods, achieving up to a 23.6% gain in Recall@10.

Technology Category

Application Category

📝 Abstract
Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Integrates collaborative signals with multimodal features dynamically
Converts fused multimodal data into discrete semantic codes
Uses a language model to generate item codes for recommendation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically fuses multimodal features using collaborative signals.
Employs Residual Quantization VAE for discrete semantic tokenization.
Fine-tunes large language model for autoregressive item code generation.
🔎 Similar Papers
No similar papers found.
Yuzhen Lin
Yuzhen Lin
Shenzhen University
Multimedia ForensicsMultimedia Security
H
Hongyi Chen
Samueli School of Engineering, University of California, Los Angeles, CA 90095, USA
X
Xuanjing Chen
Columbia Business School, Columbia University, New York, NY 10027, USA
Shaowen Wang
Shaowen Wang
Professor, University of Illinois Urbana-Champaign
CyberGISGeospatial Data ScienceSpatial AISpatial AnalysisSustainability
I
Ivonne Xu
Department of Physics, University of Chicago, Chicago, IL 60637, USA
D
Dongming Jiang
Department of Computer Science, Rice University, Houston, TX 77005, USA