๐ค AI Summary
Existing models for advertising video understanding struggle to bridge pixel-level perception with high-level marketing logic, failing to effectively decode the intricate relationship between visual narratives and implicit persuasion strategies. To address this gap, this work proposes AD-MIR, a framework that employs a two-stage process of structured memory construction and reasoning. First, the input video is transformed into a structured database by integrating semantic retrieval and keyword matching; then, an iterative query mechanism coupled with frame-level evidence verification simulates expert-like, verifiable, and self-correcting reasoning. AD-MIR achieves the first end-to-end interpretable inference pipeline that connects fine-grained brand elements to abstract persuasive intents. On the AdsQA benchmark, it attains state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict accuracy and 9.5% in relaxed accuracy.
๐ Abstract
Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.