Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

📅 2025-12-07

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address representation misalignment and gradient interference during fine-tuning of large vision-language models (LVLMs) in multimodal recommendation, this paper proposes SDA—a lightweight framework. Methodologically: (1) cross-modal structural alignment is introduced as a soft teacher to guide fine-grained alignment between visual and linguistic embeddings; (2) a gated low-rank expert pathway is designed to decouple modality-specific gradient flows, thereby mitigating interference from shared adapters. Technically, SDA integrates low-rank adaptation, gating mechanisms, and structured distillation to balance efficient fine-tuning with representation consistency. Extensive experiments on three Amazon datasets demonstrate average improvements of 6.15% in Hit@10 and 8.64% in NDCG@10; notably, gains for long-tail items reach 18.70%. Moreover, inference overhead remains negligible.

Technology Category

Application Category

📝 Abstract

Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures as a soft teacher, while MoDA mitigates gradient conflicts via expertized, gated low-rank paths to disentangle gradient flows. Experiments on three public Amazon datasets show SDA integrates seamlessly with existing multimodal and sequential recommenders, yielding average gains of 6.15% in Hit@10 and 8.64% in NDCG@10. It also achieves up to 12.83% and 18.70% gains on long-tail items with minimal inference overhead. Our code and full experimental results are available at https://github.com/RaoZhongtao/SDA.

Problem

Research questions and friction points this paper is trying to address.

Aligns multimodal embeddings to bridge domain gaps between general pre-training and item data

Mitigates gradient conflicts during fine-tuning by disentangling modality-specific adaptation paths

Enhances multimodal recommendation accuracy while maintaining lightweight inference efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural alignment uses intra-modal structures as soft teacher

Disentangled adaptation employs gated low-rank paths for gradients

Lightweight framework integrates cross-modal alignment and modality disentanglement

🔎 Similar Papers

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

2024-08-19arXiv.orgCitations: 1

MMREC: LLM Based Multi-Modal Recommender System

2024-08-08International Workshop on Semantic and Social Media Adaptation and PersonalizationCitations: 13

NoteLLM-2: Multimodal Large Representation Models for Recommendation

2024-05-27arXiv.orgCitations: 7

💼 Related Jobs

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

Natural Language Processing Researcher

Kitware

Clifton Park, New York / Carrboro, North Carolina / Minneapolis, MN

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Machine Learning Scientist 5 - Multi-modal Algorithms for Games

Netflix

$466,000.00 - $750,000.00

Los Gatos,California,United States of America / Los Angeles,California,United States of America

Authors to Follow