Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In micro-video recommendation, freezing large vision-language models (LVLMs) as black-box feature extractors lacks systematic evaluation. This paper presents the first empirical study revealing three key insights: (1) intermediate decoder hidden states substantially outperform generated caption embeddings; (2) item ID embeddings are indispensable and not substitutable; and (3) hidden states across LVLM layers contribute heterogeneously to recommendation performance. Building on these findings, we propose a lightweight Adaptive Dual-Feature Fusion (DFF) framework that jointly models and hierarchically fuses ID embeddings with multi-layer LVLM hidden states, optimized end-to-end for ranking objectives. Evaluated on two real-world micro-video recommendation benchmarks, DFF achieves state-of-the-art performance, significantly surpassing strong baselines. Extensive experiments confirm DFF’s plug-and-play compatibility and cross-model generalizability.

Technology Category

Application Category

📝 Abstract
Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.
Problem

Research questions and friction points this paper is trying to address.

Systematically evaluates frozen LVLM integration strategies for micro-video recommendation
Compares feature extraction methods: captions versus intermediate hidden states
Proposes a fusion framework to combine LVLM features with ID embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses intermediate decoder hidden states with ID embeddings
Adaptively combines multi-layer LVLM representations through fusion
Lightweight plug-and-play framework for frozen LVLM integration
🔎 Similar Papers
No similar papers found.
H
Huatuan Sun
Nanjing University of Science and Technology, China
Yunshan Ma
Yunshan Ma
Singapore Management University, NUS
Multimodal Event ForecastingBundle RecommendationComputational Fashion/Finance/Security/Politics
C
Changguang Wu
Nanjing University of Science and Technology, China
Y
Yanxin Zhang
University of Wisconsin-Madison, USA
P
Pengfei Wang
GienTech Technology Co., Ltd., China
X
Xiaoyu Du
Nanjing University of Science and Technology, China