BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work proposes BlackMirror, a plug-and-play framework for detecting backdoor attacks in text-to-image models under black-box settings, where high visual diversity renders conventional image similarity metrics ineffective. BlackMirror is the first approach to leverage semantic consistency between textual prompts and generated images, introducing two key modules: MirrorMatch, which aligns visual outputs with input instructions to identify semantic deviations, and MirrorVerify, which assesses the stability of such deviations across diverse prompts. Requiring no model retraining or internal access, BlackMirror can be directly deployed on Model-as-a-Service (MaaS) platforms. Extensive evaluations demonstrate its superior performance over existing methods across multiple backdoor attack scenarios, achieving both high detection accuracy and strong generalization capability.

Technology Category

Application Category

📝 Abstract

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

Problem

Research questions and friction points this paper is trying to address.

backdoor detection

text-to-image models

black-box setting

semantic deviation

model security

Innovation

Methods, ideas, or system contributions that make the work stand out.

backdoor detection

text-to-image models

black-box setting