🤖 AI Summary
This study addresses the deceptive practices prevalent in multimodal advertisements on short-video platforms, where manipulation often arises from coordinated misuse of visual, audio, and textual modalities. To combat this, the authors propose a policy-driven, rule-guided multitask auditing framework that integrates chain-of-thought reasoning, multimodal alignment, and reinforcement learning to detect both intra-modal manipulation and cross-modal inconsistencies. A novel rule-based In-Context Chain-of-Thought (ICoT) data synthesis pipeline is introduced to drastically reduce annotation costs. The framework further employs a composite reward mechanism that jointly optimizes causal coherence and regulatory compliance. Evaluated on real-world advertising data, the model significantly outperforms strong baselines in accuracy, consistency, and generalization, while maintaining high interpretability and robustness.
📝 Abstract
Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.