Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work identifies a novel security vulnerability in conversational multimodal large language models (e.g., Gemini-2.0-flash-preview): overreliance on self-generated dialogue history enables attackers to inject malicious instructions via API-level manipulation of historical responses, thereby evading input sanitization and eliciting harmful outputs. To formalize this threat, we propose the “Trojan Prompt” attack paradigm—the first to expose and characterize an “asymmetric safety alignment” flaw, wherein models rigorously enforce user-request refusal policies yet neglect verification of historical messages. Empirical evaluation on real-world API deployments demonstrates that our attack achieves significantly higher success rates than state-of-the-art user-side jailbreaking techniques. These findings underscore a critical gap in current multimodal model design: inadequate dialogue state management undermines systemic safety guarantees. The study serves as both a cautionary insight and a foundational step toward building trustworthy, history-aware conversational AI systems.

Technology Category

Application Category

📝 Abstract

The rise of conversational interfaces has greatly enhanced LLM usability by leveraging dialogue history for sophisticated reasoning. However, this reliance introduces an unexplored attack surface. This paper introduces Trojan Horse Prompting, a novel jailbreak technique. Adversaries bypass safety mechanisms by forging the model's own past utterances within the conversational history provided to its API. A malicious payload is injected into a model-attributed message, followed by a benign user prompt to trigger harmful content generation. This vulnerability stems from Asymmetric Safety Alignment: models are extensively trained to refuse harmful user requests but lack comparable skepticism towards their own purported conversational history. This implicit trust in its "past" creates a high-impact vulnerability. Experimental validation on Google's Gemini-2.0-flash-preview-image-generation shows Trojan Horse Prompting achieves a significantly higher Attack Success Rate (ASR) than established user-turn jailbreaking methods. These findings reveal a fundamental flaw in modern conversational AI security, necessitating a paradigm shift from input-level filtering to robust, protocol-level validation of conversational context integrity.

Problem

Research questions and friction points this paper is trying to address.

Exploits model trust in forged conversational history

Bypasses safety via malicious model-attributed messages

Reveals asymmetric safety alignment vulnerability in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Forging model's past utterances for attack

Injecting payload into model-attributed message

Exploiting asymmetric safety alignment vulnerability

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization