🤖 AI Summary
This study addresses the vulnerability of large language models (LLMs) to generating rhetorically manipulative propaganda content in open-ended interactions. It presents the first systematic evaluation of LLMs’ capacity to produce such content, introducing a novel assessment framework that integrates a propaganda text classifier with a rhetorical device detection model. The work comparatively analyzes the effectiveness of prominent alignment techniques—including supervised fine-tuning (SFT), direct preference optimization (DPO), and orthogonal preference optimization (ORPO)—in mitigating rhetorical manipulation. Experimental results demonstrate that ORPO achieves superior performance in suppressing the generation of propagandistic outputs, significantly reducing the model’s propensity to produce content exhibiting manipulative rhetoric. These findings substantiate ORPO’s efficacy as an alignment strategy for enhancing the safety and reliability of LLMs in adversarial or unstructured settings.
📝 Abstract
Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.