LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This study reveals that large language models (LLMs) are highly vulnerable to adversarial prompts disguised as legitimate social science discourse—so-called “pseudoscientific rhetoric”—which significantly amplify bias (up to 300% increase in some models) and toxic output while systematically evading safety mechanisms. Method: We introduce, for the first time, academic language as a high-stealth jailbreak vector; integrate StereoSet-based adversarial prompting; conduct cross-model quantitative evaluation of bias and toxicity across GPT-4o, Llama-3, Gemini, and others; and perform dialogue-evolution analysis. Contribution/Results: We identify an overreliance on authoritative signals (e.g., author names, journal titles) as a key failure mode, empirically demonstrate cumulative bias amplification induced by pseudoscientific framing, and validate findings consistently across major closed- and open-weight LLMs. These results establish a novel paradigm for LLM safety evaluation and robustness enhancement.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art proprietary and open-source LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, LLama3-405B-Instruct, Llama3-70B-Instruct, Cohere, Gemini models on the StereoSet data demonstrate that, the models' biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can also be manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by ill-intended actors to systematically jailbreak even the strongest models like GPT. Our analysis studies various factors that contribute to the models' vulnerabilities to malicious requests in academic language. Mentioning author names and venues enhances the persuasiveness of some models, and the bias scores can increase as dialogues progress. Our findings call for a more careful investigation on the use of scientific data in the training of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Bias and Harmful Outputs

Security and Ethics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Models

Bias Amplification

Scientific Misinformation

🔎 Similar Papers

No similar papers found.