Scaling Reinforcement Learning for Content Moderation with Large Language Models

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Content moderation faces accuracy bottlenecks due to label sparsity, dynamically evolving policies, and insufficient deep reasoning. Method: We propose the first RL-LLM co-training framework for real-time, large-scale (hundreds of millions of users) AIGC compliance auditing. We empirically discover an S-shaped scaling law of RL in moderation tasks; design a verifiable reward mechanism and LLM-as-judge reward modeling; and integrate multi-stage rollout training with a policy-distillation classifier architecture to jointly optimize policy alignment and data efficiency. Results: On three real-world moderation tasks, our approach significantly improves complex policy-reasoning accuracy, achieves 100× higher data efficiency than supervised fine-tuning, and exhibits smooth, saturating performance gains with increasing model or data scale—establishing a scalable, verifiable paradigm for expert-level auditing under dynamic regulatory policies.

Technology Category

Application Category

📝 Abstract

Content moderation at scale remains one of the most pressing challenges in today's digital ecosystem, where billions of user- and AI-generated artifacts must be continuously evaluated for policy violations. Although recent advances in large language models (LLMs) have demonstrated strong potential for policy-grounded moderation, the practical challenges of training these systems to achieve expert-level accuracy in real-world settings remain largely unexplored, particularly in regimes characterized by label sparsity, evolving policy definitions, and the need for nuanced reasoning beyond shallow pattern matching. In this work, we present a comprehensive empirical investigation of scaling reinforcement learning (RL) for content classification, systematically evaluating multiple RL training recipes and reward-shaping strategies-including verifiable rewards and LLM-as-judge frameworks-to transform general-purpose language models into specialized, policy-aligned classifiers across three real-world content moderation tasks. Our findings provide actionable insights for industrial-scale moderation systems, demonstrating that RL exhibits sigmoid-like scaling behavior in which performance improves smoothly with increased training data, rollouts, and optimization steps before gradually saturating. Moreover, we show that RL substantially improves performance on tasks requiring complex policy-grounded reasoning while achieving up to 100x higher data efficiency than supervised fine-tuning, making it particularly effective in domains where expert annotations are scarce or costly.

Problem

Research questions and friction points this paper is trying to address.

Scaling reinforcement learning for content moderation with LLMs

Training systems for expert-level accuracy in real-world settings

Improving data efficiency and reasoning in sparse-label domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling reinforcement learning for content moderation

Using verifiable rewards and LLM-as-judge frameworks

Achieving high data efficiency with RL over supervised fine-tuning

🔎 Similar Papers

No similar papers found.