Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Implicit harmful content detection and contextual ambiguity in video moderation pose significant challenges; conventional models suffer from poor generalization, while multimodal large language models (MLLMs) incur prohibitive computational overhead and are ill-suited for discriminative tasks due to their generative architecture, hindering industrial deployment. Method: We propose a lightweight router-ranking cascade system: (1) we reformulate a generative MLLM into an efficient multimodal classifier via discriminative fine-tuning, and (2) introduce a semantic-aware routing mechanism for sample-adaptive task partitioning. Contribution/Results: Fine-tuned with only 2% labeled data, our method achieves a 66.50% F1-score improvement over baselines. In production, it increases automated moderation throughput by 41% and reduces computational cost to just 1.5% of full MLLM inference. This approach bridges accuracy and efficiency, establishing a scalable, industrially viable paradigm for multimodal content moderation.

Technology Category

Application Category

📝 Abstract

Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.

Problem

Research questions and friction points this paper is trying to address.

Addressing high computational cost of MLLMs for industrial deployment

Transforming generative MLLMs into effective multimodal classifiers

Improving content moderation accuracy in complex video scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transform generative MLLM into multimodal classifier

Router-ranking cascade system for cost efficiency

Minimal discriminative data for MLLM adaptation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs