🤖 AI Summary
Generative recommendation faces two key bottlenecks: shallow integration of collaborative signals and overly rigid coupling of multimodal features, limiting comprehensive item representation. To address these, we propose an end-to-end multimodal generative recommendation framework that— for the first time—dynamically guides multimodal fusion using collaborative signals. We introduce a Residual Quantized Variational Autoencoder (RQ-VAE) to jointly encode user–item collaborative semantics and visual/textual features into unified discrete latent codes, bridging collaborative filtering with large language model (LLM)-based generation. Our framework jointly models multimodal fusion, RQ-VAE encoding, LLM fine-tuning, and autoregressive discrete code generation. Extensive experiments on multiple public benchmarks demonstrate substantial improvements over state-of-the-art generative and discriminative methods, achieving up to a 23.6% gain in Recall@10.
📝 Abstract
Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.