OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing harmful prompt detection methods exhibit poor generalization across languages and modalities (text, image, audio). To address this, we propose the first language- and modality-agnostic unified safety classifier based on internal representation alignment. Our method leverages cross-modal aligned hidden-layer embeddings from LLMs and multimodal LLMs (MLLMs), employing a lightweight classification head and an embedding reuse mechanism during generation for efficient, real-time moderation. Experiments show improvements of +11.57% in multilingual text detection accuracy, +20.44% in image prompt detection, and establish a new state-of-the-art for audio prompt detection. Inference is approximately 120× faster than the best-performing baseline. The core innovation lies in decoupling modality-specific input processing from safety discrimination, enabling, for the first time, unified, representation-driven joint modeling of harmfulness across multiple modalities and languages.

Technology Category

Application Category

📝 Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($approx 120 imes$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

Problem

Research questions and friction points this paper is trying to address.

Detecting harmful prompts across languages and modalities

Improving classification accuracy for multilingual and multimodal inputs

Enhancing efficiency in AI safety moderation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects harmful prompts across languages and modalities

Uses internal representations aligned across languages or modalities

Repurposes embeddings for efficiency and speed

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?