Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Moderation

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Short-video platforms face significant challenges in content safety, including high annotation costs and poor cross-category generalization. To address these issues, we propose a unified multimodal large language model (MLLM) pretraining framework specifically designed for inappropriate content detection. Our method innovatively integrates three synergistic pretraining stages—caption generation, visual question answering, and chain-of-thought reasoning—to jointly enhance visual perception, semantic understanding, and logical reasoning capabilities. Leveraging domain-adaptive pretraining, vision-language alignment, instruction tuning, and chain-based reasoning, the framework enables end-to-end violation classification. Experimental results demonstrate substantial improvements over strong baselines under both zero-shot and supervised settings, with particularly robust generalization to unseen violation categories. This work provides an efficient, scalable, and unified solution for short-video content safety governance.

Technology Category

Application Category

📝 Abstract
Short video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical. Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization. We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks: (1) extit{Caption}, to enhance the MLLM's perception of video details; (2) extit{Visual Question Answering (VQA)}, to deepen the MLLM's understanding of issue definitions and annotation guidelines; (3) extit{Chain-of-Thought (CoT)}, to enhance the MLLM's reasoning capability. Experimental results show that our pretraining approach significantly improves the MLLM's performance in both zero-shot and supervised fine-tuning (SFT) settings. In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.
Problem

Research questions and friction points this paper is trying to address.

Detecting inappropriate content on rapidly evolving short video platforms
Overcoming distribution gaps between short videos and pretraining data
Addressing complex issue definitions requiring enhanced reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-enhanced MLLM pretraining for content moderation
Domain-adaptive pretraining with three specialized tasks
Caption, VQA, and Chain-of-Thought enhance video understanding
Z
Zixuan Wang
TikTok
Y
Yu Sun
TikTok
H
Hongwei Wang
TikTok
Baoyu Jing
Baoyu Jing
University of Illinois at Urbana-Champaign
X
Xiang Shen
TikTok
X
Xin Dong
TikTok
Zhuolin Hao
Zhuolin Hao
Bytedance
Hongyu Xiong
Hongyu Xiong
Stanford University
Image ProcessingApplied Machine LearningComputer VisionSmart Manufacturing
Y
Yang Song
TikTok