VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address the coarse-grained user-item modeling, insufficient multimodal understanding, and inability of static prompts to capture dynamic user preferences in video recommendation, this paper proposes VRAgent-R1—the first dual-agent recommendation framework grounded in multimodal large language models (MLLMs). The Item Perception (IP) Agent achieves fine-grained video content understanding via progressive visual-semantic parsing, while the User Simulation (US) Agent integrates chain-of-thought (CoT) reasoning with reinforcement learning to model dynamic user decision-making. Their collaborative interaction establishes an interactive, joint user-item modeling mechanism, overcoming limitations of conventional frozen-LLM-plus-prompt-engineering paradigms. On MicroLens-100k, the IP Agent improves NDCG@10 by 6.0%, and the US Agent achieves a 45.0% higher user decision simulation accuracy than the best baseline.

Technology Category

Application Category

📝 Abstract

Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video recommendation via MLLM-based agents

Improving multimodal content understanding in recommendations

Aligning recommendations with real user preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-based agents for video recommendation

Interactive user-item modeling with IP and US agents

Reinforcement learning for user preference alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs