Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing two key challenges in multimodal sequential recommendation—sample difficulty imbalance and cross-modal semantic misalignment—this paper proposes HaNoRec, a novel framework. Methodologically, it (1) introduces a hardness-aware negative sampling mechanism that dynamically weights hard negatives to mitigate both overfitting and undertraining; (2) incorporates Gaussian perturbation regularization for preference learning, enabling the policy model to autonomously calibrate modality misalignment without relying on a fixed reference model—particularly enhancing long-sequence modeling; and (3) fuses textual and visual modalities, leveraging language-based reasoning to model behavioral sequences and visual signals to enrich interest representations. Extensive experiments on multiple benchmark datasets demonstrate that HaNoRec consistently outperforms state-of-the-art methods, achieving superior recommendation accuracy, improved cross-modal semantic consistency, and enhanced robustness.

Technology Category

Application Category

📝 Abstract
Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced sample hardness in multimodal sequential recommendation optimization
Mitigates cross-modal semantic bias in preference learning for recommendation systems
Enhances multimodal alignment through adaptive hardness-aware and noise-regularized optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive preference optimization with hardness-aware weighting
Gaussian-perturbed distribution optimization for modality alignment
Multimodal LLM framework integrating visual and textual signals
🔎 Similar Papers
No similar papers found.