A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capability of large language models (LLMs) to generate personalized disinformation tailored to demographic attributes (e.g., age, cultural background) and their efficacy in evading existing safety mechanisms—including content filters and jailbreak defenses. We introduce AI-TRAITS, the first multilingual, structured dataset for evaluating attribute-aware adversarial prompting, encompassing four languages, 324 distinct false narratives, and 150 fine-grained persona combinations. Leveraging a multilingual red-teaming framework integrating role-based modeling, prompt engineering, and quantitative persuasiveness analysis, we systematically assess mainstream LLMs across intergenerational and cross-cultural settings. Results demonstrate that minimal personalization strategies substantially increase jailbreak success rates (average +37.2%), while simultaneously improving linguistic adaptation and perceived credibility of deceptive outputs. This work provides the first empirical evidence that personalization constitutes a critical vulnerability in current LLM safety architectures.

Technology Category

Application Category

📝 Abstract
The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safety robustness against persona-targeted disinformation prompts
Analyzing how personalization increases jailbreak likelihood across multilingual models
Investigating altered linguistic patterns in AI-generated personalized false narratives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Red teaming methodology tests LLM safety robustness
AI-TRAITS dataset contains 1.6M personalized disinformation texts
Multilingual prompts combine disinformation narratives with personas
🔎 Similar Papers
No similar papers found.
J
João A. Leite
University of Sheffield, Department of Computer Science, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, United Kingdom
Arnav Arora
Arnav Arora
University of Copenhagen
Natural Language ProcessingFact CheckingComputational Social ScienceAI Safety
S
Silvia Gargova
Big Data for Smart Society Institute (GATE), 5, James Bourchier Blvd, Sofia, 1164, Bulgaria
J
João Luz
University of São Paulo, Institute of Mathematics and Computer Sciences (ICMC), Av. Trabalhador São-Carlense, 400, São Carlos, 13566-590, Brazil
G
Gustavo Sampaio
University of São Paulo, Institute of Mathematics and Computer Sciences (ICMC), Av. Trabalhador São-Carlense, 400, São Carlos, 13566-590, Brazil
Ian Roberts
Ian Roberts
LSHTM
Clinical trials
Carolina Scarton
Carolina Scarton
Senior Lecturer in Natural Language Processing, NLP group / GATE group, University of Sheffield
Social Media AnalysisText SimplificationMachine TranslationNatural Language ProcessingArtificial Intelligence
Kalina Bontcheva
Kalina Bontcheva
Professor of Text Analytics, University of Sheffield
Natural Language Processing