🤖 AI Summary
To address the fine-grained detection challenge of malicious users evading content moderation in children’s video platforms by embedding violent or nudity content within single video frames, this paper proposes a cascaded cross-modal Transformer architecture—marking the first instance of deep audio-visual fusion for such tasks. Methodologically: (1) intra-modal Transformer encoders model local temporal patterns in audio and visual streams; (2) a cascaded cross-Transformer enables dynamic audio-visual alignment; and (3) a dual-layer interaction mechanism—comprising both intra-modal and inter-modal modules—facilitates multi-granularity feature fusion. Evaluated on fine-grained harmful content detection in children’s videos, our approach significantly outperforms unimodal baselines and state-of-the-art multimodal fusion methods, establishing new SOTA performance. This work introduces a novel paradigm for cross-modal, fine-grained content safety analysis in multimedia platforms.
📝 Abstract
As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.