LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study reveals that large language models (LLMs) are highly vulnerable to adversarial prompts disguised as legitimate social science discourse—so-called “pseudoscientific rhetoric”—which significantly amplify bias (up to 300% increase in some models) and toxic output while systematically evading safety mechanisms. Method: We introduce, for the first time, academic language as a high-stealth jailbreak vector; integrate StereoSet-based adversarial prompting; conduct cross-model quantitative evaluation of bias and toxicity across GPT-4o, Llama-3, Gemini, and others; and perform dialogue-evolution analysis. Contribution/Results: We identify an overreliance on authoritative signals (e.g., author names, journal titles) as a key failure mode, empirically demonstrate cumulative bias amplification induced by pseudoscientific framing, and validate findings consistently across major closed- and open-weight LLMs. These results establish a novel paradigm for LLM safety evaluation and robustness enhancement.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art proprietary and open-source LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, LLama3-405B-Instruct, Llama3-70B-Instruct, Cohere, Gemini models on the StereoSet data demonstrate that, the models' biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can also be manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by ill-intended actors to systematically jailbreak even the strongest models like GPT. Our analysis studies various factors that contribute to the models' vulnerabilities to malicious requests in academic language. Mentioning author names and venues enhances the persuasiveness of some models, and the bias scores can increase as dialogues progress. Our findings call for a more careful investigation on the use of scientific data in the training of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Bias and Harmful Outputs
Security and Ethics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language Models
Bias Amplification
Scientific Misinformation
🔎 Similar Papers
No similar papers found.
Yubin Ge
Yubin Ge
Applied Scientist, Amazon Web Services
Natural Language ProcessingHuman-Computer Interaction
N
Neeraja Kirtane
University of Illinois Urbana-Champaign
H
Hao Peng
University of Illinois Urbana-Champaign
D
Dilek Hakkani-Tur
University of Illinois Urbana-Champaign