Embedding-based Retrieval in Multimodal Content Moderation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional classification-based content moderation on short-video platforms suffers from delayed response, high operational cost, and poor adaptability to emerging violation patterns. To address these challenges, this paper proposes an embedding-based multimodal moderation framework. We innovatively integrate supervised contrastive learning (SCL) into both unimodal and multimodal embedding modeling and design an end-to-end system comprising embedding generation and approximate nearest neighbor (ANN) retrieval—marking the first practical deployment of embedding retrieval for multimodal content moderation. The approach significantly enhances moderation flexibility, interpretability, and real-time responsiveness. Experiments demonstrate strong generalization across 25 emerging violation categories, achieving ROC-AUC of 0.99 and PR-AUC of 0.95. In live production, the system increases actionable detection rate by 10.32% while reducing operational costs by over 80%.

Technology Category

Application Category

📝 Abstract
Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding generation and video retrieval to enable efficient and effective trend handling. Comprehensive offline experiments on 25 diverse emerging trends show that EBR improves ROC-AUC from 0.85 to 0.99 and PR-AUC from 0.35 to 0.95. Further online experiments reveal that EBR increases action rates by 10.32% and reduces operational costs by over 80%, while also enhancing interpretability and flexibility compared to classification-based solutions.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video content moderation efficiency and adaptability
Overcoming limitations of classification in rapid trend response
Reducing operational costs while improving detection performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised Contrastive Learning for embedding models
Embedding-Based Retrieval system for video moderation
Multi-modal foundation models outperform CLIP and MoCo
🔎 Similar Papers
No similar papers found.
H
Hanzhong Liang
TikTok, San Jose, CA, USA
Jinghao Shi
Jinghao Shi
Carnegie Mellon University
Computer VisionMachine Learning
X
Xiang Shen
TikTok, Bellevue, WA, USA
Z
Zixuan Wang
TikTok, San Jose, CA, USA
V
Vera Wen
TikTok, San Jose, CA, USA
A
Ardalan Mehrani
TikTok, San Jose, CA, USA
Z
Zhiqian Chen
TikTok, San Jose, CA, USA
Y
Yifan Wu
TikTok, San Jose, CA, USA
Zhixin Zhang
Zhixin Zhang
Ph.D of Robotics, University of Manchester
SLAMVINSLIOSensor FusionRobotics