VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the coarse-grained user-item modeling, insufficient multimodal understanding, and inability of static prompts to capture dynamic user preferences in video recommendation, this paper proposes VRAgent-R1—the first dual-agent recommendation framework grounded in multimodal large language models (MLLMs). The Item Perception (IP) Agent achieves fine-grained video content understanding via progressive visual-semantic parsing, while the User Simulation (US) Agent integrates chain-of-thought (CoT) reasoning with reinforcement learning to model dynamic user decision-making. Their collaborative interaction establishes an interactive, joint user-item modeling mechanism, overcoming limitations of conventional frozen-LLM-plus-prompt-engineering paradigms. On MicroLens-100k, the IP Agent improves NDCG@10 by 6.0%, and the US Agent achieves a 45.0% higher user decision simulation accuracy than the best baseline.

Technology Category

Application Category

📝 Abstract
Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video recommendation via MLLM-based agents
Improving multimodal content understanding in recommendations
Aligning recommendations with real user preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based agents for video recommendation
Interactive user-item modeling with IP and US agents
Reinforcement learning for user preference alignment
🔎 Similar Papers
No similar papers found.
Siran Chen
Siran Chen
University of Chinese Academy of Science
semiconductor,AI model
Boyu Chen
Boyu Chen
The University of Sydney
Neural Architecture SearchTransformer
Chenyun Yu
Chenyun Yu
phd, Department of Computer Science, City University of Hong Kong
Data science and managementquery optimizationdata mininginformation security
Y
Yuxiao Luo
SIAT@MMLab, Platform and Content Group, Tencent
O
Ouyang Yi
Platform and Content Group, Tencent
L
Lei Cheng
Platform and Content Group, Tencent
C
Chengxiang Zhuo
Platform and Content Group, Tencent
Z
Zang Li
Platform and Content Group, Tencent
Y
Yali Wang
SIAT@MMLab, Platform and Content Group, Tencent; Shanghai AILab