AgentPS: Agentic Process Supervision for Multi-modal Content Quality Assurance through Multi-round QA

📅 2024-12-15

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

145K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit limited performance on content moderation tasks requiring fine-grained logical reasoning. Method: We propose an agent-driven process supervision framework that deeply integrates structured sequential reasoning with multi-round question-answering (QA) fine-tuning—marking the first end-to-end embedding of process supervision into the multi-round QA training pipeline. The framework enables LLMs to autonomously generate high-quality supervision signals, eliminating reliance on manual annotations while preserving both reasoning fidelity and industrial-scale scalability. Results: Experiments on a private TikTok dataset demonstrate substantial improvements over state-of-the-art baselines. Notably, using LLM-generated labels alone retains over 90% of the performance gain achieved with human annotations, validating the framework’s effectiveness and practical deployability in large-scale real-world scenarios.

Technology Category

Application Category

📝 Abstract

The advanced processing and reasoning capabilities of multimodal large language models (MLLMs) have driven substantial progress in vision-language (VL) understanding tasks. However, while effective for tasks governed by straightforward logic, MLLMs often encounter challenges when reasoning over complex, interdependent logic structures. To address this limitation, we introduce extit{AgentPS}, a novel framework that integrates Agentic Process Supervision into MLLMs via multi-round question answering during fine-tuning. extit{AgentPS} demonstrates significant performance improvements over baseline MLLMs on proprietary TikTok datasets, due to its integration of process supervision and structured sequential reasoning. Furthermore, we show that replacing human-annotated labels with LLM-generated labels retains much of the performance gain, highlighting the framework's practical scalability in industrial applications. These results position extit{AgentPS} as a highly effective and efficient architecture for multimodal classification tasks. Its adaptability and scalability, especially when enhanced by automated annotation generation, make it a powerful tool for handling large-scale, real-world challenges.

Problem

Research questions and friction points this paper is trying to address.

Enhances MLLMs for complex logical reasoning tasks

Improves multimodal content moderation with agentic supervision

Reduces reliance on human annotations for scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Agentic Process Supervision into MLLMs

Sequentially reasons over ancillary questions

Uses MLLM-generated labels for scalability

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation