Customize Multi-modal RAI Guardrails with Precedent-based predictions

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Multimodal content moderation faces three core challenges: highly customized policies, scarce labeled examples, and dynamically evolving policy requirements. Existing fine-tuning approaches suffer from poor generalization, while training-free methods are constrained by context length limitations. This paper proposes a precedent-based conditional reasoning framework that abandons fixed-policy modeling. It employs a critique-and-revise mechanism to autonomously generate high-quality precedents and integrates precedent retrieval with chain-of-thought reasoning to enable zero-shot policy generalization and continual adaptation. Leveraging multimodal large language models and in-context learning, the method requires no parameter updates to accommodate new policies. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods under both few-shot and full-data settings, exhibits superior generalization to unseen policies, and drastically reduces retraining overhead for policy updates.

Technology Category

Application Category

📝 Abstract

A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

Problem

Research questions and friction points this paper is trying to address.

Filtering image content based on customizable user policies

Adapting guardrails to evolving standards with minimal retraining

Overcoming limitations of existing fine-tuning and training-free methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses precedent-based predictions for flexibility

Introduces critique-revise mechanism for precedents

Leverages two strategies for robust prediction

🔎 Similar Papers

Customizing Visual-Language Foundation Models for Multi-modal Anomaly Detection and Reasoning