🤖 AI Summary
Addressing two key challenges in multimodal sequential recommendation—sample difficulty imbalance and cross-modal semantic misalignment—this paper proposes HaNoRec, a novel framework. Methodologically, it (1) introduces a hardness-aware negative sampling mechanism that dynamically weights hard negatives to mitigate both overfitting and undertraining; (2) incorporates Gaussian perturbation regularization for preference learning, enabling the policy model to autonomously calibrate modality misalignment without relying on a fixed reference model—particularly enhancing long-sequence modeling; and (3) fuses textual and visual modalities, leveraging language-based reasoning to model behavioral sequences and visual signals to enrich interest representations. Extensive experiments on multiple benchmark datasets demonstrate that HaNoRec consistently outperforms state-of-the-art methods, achieving superior recommendation accuracy, improved cross-modal semantic consistency, and enhanced robustness.
📝 Abstract
Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.