🤖 AI Summary
This study addresses the limitations of existing video understanding benchmarks in handling the dense multimodal signals and commercial intent reasoning required by e-commerce short videos. To this end, the authors introduce E-VAds, the first multimodal large language model evaluation benchmark specifically designed for e-commerce short videos, comprising 3,961 high-quality videos and 19,785 open-ended question-answer pairs spanning five tasks across perception, cognition, and reasoning. They further propose MG-GRPO, a reinforcement learning training strategy based on a multi-granularity reward mechanism, which boosts commercial intent reasoning performance by 109.2% with only a few hundred training samples. The work also incorporates a multimodal information density evaluation framework and a multi-agent question-answer generation mechanism, significantly enhancing few-shot reasoning capabilities and demonstrating both the effectiveness of the benchmark and the superiority of the proposed approach.
📝 Abstract
E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.