Multi-Aspect Cross-modal Quantization for Generative Recommendation

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In generative recommendation, multimodal semantic IDs often suffer from absent hierarchical structure, severe inter-modal conflicts, and low-quality discrete encoding—limiting generative model performance. To address this, we propose the first multi-view quantization framework integrating both implicit and explicit cross-modal alignment. Our method jointly optimizes cross-modal discrete encoding learning, sequential modeling, and next-token prediction to hierarchically construct semantic IDs while minimizing modality conflicts. Its key innovation lies in unifying explicit alignment (via alignment loss) and implicit alignment (via collaborative reconstruction) within both ID learning and generative training, thereby substantially enhancing modality complementarity and semantic consistency. Extensive experiments on three benchmark datasets demonstrate significant improvements over state-of-the-art baselines in both recommendation accuracy and generated text quality.

Technology Category

Application Category

📝 Abstract
Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users'historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.
Problem

Research questions and friction points this paper is trying to address.

Constructing hierarchically organized semantic IDs with minimal conflicts
Integrating multimodal information for generative recommendation systems
Enhancing cross-modal interactions to improve generative model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal quantization reduces conflict rates in ID learning
Multi-aspect alignments enhance generative model training
Integrates multimodal information through complementary methods
🔎 Similar Papers
No similar papers found.
F
Fuwei Zhang
Institute of Artificial Intelligence, Beihang University
X
Xiaoyu Liu
Institute of Artificial Intelligence, Beihang University
D
Dongbo Xi
Meituan
J
Jishen Yin
Meituan
Huan Chen
Huan Chen
Shunfeng Technology Company Limited
Artificial IntelligenceFormal Methods
Peng Yan
Peng Yan
Research Assistant of ZHAW, PhD student of UZH
Deep LearningTransfer LearningIntelligent Algorithm
F
Fuzhen Zhuang
Institute of Artificial Intelligence, Beihang University
Z
Zhao Zhang
SKLCCSE, School of Computer Science and Engineering, Beihang University