Multi-Aspect Cross-modal Quantization for Generative Recommendation

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

In generative recommendation, multimodal semantic IDs often suffer from absent hierarchical structure, severe inter-modal conflicts, and low-quality discrete encoding—limiting generative model performance. To address this, we propose the first multi-view quantization framework integrating both implicit and explicit cross-modal alignment. Our method jointly optimizes cross-modal discrete encoding learning, sequential modeling, and next-token prediction to hierarchically construct semantic IDs while minimizing modality conflicts. Its key innovation lies in unifying explicit alignment (via alignment loss) and implicit alignment (via collaborative reconstruction) within both ID learning and generative training, thereby substantially enhancing modality complementarity and semantic consistency. Extensive experiments on three benchmark datasets demonstrate significant improvements over state-of-the-art baselines in both recommendation accuracy and generated text quality.

Technology Category

Application Category

📝 Abstract

Generative Recommendation (GR) has emerged as a new paradigm in recommender systems. This approach relies on quantized representations to discretize item features, modeling users'historical interactions as sequences of discrete tokens. Based on these tokenized sequences, GR predicts the next item by employing next-token prediction methods. The challenges of GR lie in constructing high-quality semantic identifiers (IDs) that are hierarchically organized, minimally conflicting, and conducive to effective generative model training. However, current approaches remain limited in their ability to harness multimodal information and to capture the deep and intricate interactions among diverse modalities, both of which are essential for learning high-quality semantic IDs and for effectively training GR models. To address this, we propose Multi-Aspect Cross-modal quantization for generative Recommendation (MACRec), which introduces multimodal information and incorporates it into both semantic ID learning and generative model training from different aspects. Specifically, we first introduce cross-modal quantization during the ID learning process, which effectively reduces conflict rates and thus improves codebook usability through the complementary integration of multimodal information. In addition, to further enhance the generative ability of our GR model, we incorporate multi-aspect cross-modal alignments, including the implicit and explicit alignments. Finally, we conduct extensive experiments on three well-known recommendation datasets to demonstrate the effectiveness of our proposed method.

Problem

Research questions and friction points this paper is trying to address.

Constructing hierarchically organized semantic IDs with minimal conflicts

Integrating multimodal information for generative recommendation systems

Enhancing cross-modal interactions to improve generative model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal quantization reduces conflict rates in ID learning

Multi-aspect alignments enhance generative model training

Integrates multimodal information through complementary methods

🔎 Similar Papers

No similar papers found.