SNIFR : Boosting Fine-Grained Child Harmful Content Detection Through Audio-Visual Alignment with Cascaded Cross-Transformer

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the fine-grained detection challenge of malicious users evading content moderation in children’s video platforms by embedding violent or nudity content within single video frames, this paper proposes a cascaded cross-modal Transformer architecture—marking the first instance of deep audio-visual fusion for such tasks. Methodologically: (1) intra-modal Transformer encoders model local temporal patterns in audio and visual streams; (2) a cascaded cross-Transformer enables dynamic audio-visual alignment; and (3) a dual-layer interaction mechanism—comprising both intra-modal and inter-modal modules—facilitates multi-granularity feature fusion. Evaluated on fine-grained harmful content detection in children’s videos, our approach significantly outperforms unimodal baselines and state-of-the-art multimodal fusion methods, establishing new SOTA performance. This work introduces a novel paradigm for cross-modal, fine-grained content safety analysis in multimedia platforms.

Technology Category

Application Category

📝 Abstract

As video-sharing platforms have grown over the past decade, child viewership has surged, increasing the need for precise detection of harmful content like violence or explicit scenes. Malicious users exploit moderation systems by embedding unsafe content in minimal frames to evade detection. While prior research has focused on visual cues and advanced such fine-grained detection, audio features remain underexplored. In this study, we embed audio cues with visual for fine-grained child harmful content detection and introduce SNIFR, a novel framework for effective alignment. SNIFR employs a transformer encoder for intra-modality interaction, followed by a cascaded cross-transformer for inter-modality alignment. Our approach achieves superior performance over unimodal and baseline fusion methods, setting a new state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

Detect fine-grained harmful content for child safety

Align audio-visual cues to improve detection accuracy

Overcome evasion tactics in video moderation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses audio-visual alignment for detection

Employs cascaded cross-transformer for fusion

Leverages transformer encoder intra-modality interaction

🔎 Similar Papers

No similar papers found.