AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning

๐Ÿ“… 2026-02-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing models for advertising video understanding struggle to bridge pixel-level perception with high-level marketing logic, failing to effectively decode the intricate relationship between visual narratives and implicit persuasion strategies. To address this gap, this work proposes AD-MIR, a framework that employs a two-stage process of structured memory construction and reasoning. First, the input video is transformed into a structured database by integrating semantic retrieval and keyword matching; then, an iterative query mechanism coupled with frame-level evidence verification simulates expert-like, verifiable, and self-correcting reasoning. AD-MIR achieves the first end-to-end interpretable inference pipeline that connects fine-grained brand elements to abstract persuasive intents. On the AdsQA benchmark, it attains state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict accuracy and 9.5% in relaxed accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
Problem

Research questions and friction points this paper is trying to address.

advertising video understanding
perception-to-persuasion gap
multimodal reasoning
marketing logic
visual storytelling
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning
multimodal advertising understanding
evidence-based self-correction
structure-aware memory
persuasion strategy decoding
๐Ÿ”Ž Similar Papers
No similar papers found.
B
Binxiao Xu
Peking University, Beijing, China
J
Junyu Feng
Xiโ€™an Jiaotong University, Xiโ€™an, China
X
Xiaopeng Lin
Xiโ€™an Jiaotong University, Xiโ€™an, China
Haodong Li
Haodong Li
UC San Diego. Prev: HKUST, ZJU, Tencent.
3DVGenerative ModelsAgents
Z
Zhiyuan Feng
Tsinghua University, Beijing, China
Bohan Zeng
Bohan Zeng
PhD student, Peking University
Data-Centric AIComputer VisionDiffusion Model3D
S
Shaolin Lu
Peking University, Beijing, China
M
Ming Lu
Intel, Beijing, China
Q
Qi She
ByteDance, Beijing, China
Wentao Zhang
Wentao Zhang
Institute of Physics, Chinese Academy of Sciences
photoemissionsuperconductivitycupratehtsctime-resolved