How Far Will They Go? Red-Teaming Online Influence with Large Language Models

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This study evaluates the risk of misuse of open-source large language models (LLMs) in political influence operations, focusing on their capacity to express political viewpoints on contentious issues and their susceptibility to natural-language jailbreaking. We introduce the first systematic red-teaming framework to quantify each model’s “Overton window”—the range of political positions it can stably articulate—and examine its relationship with model scale, geographic origin, and model family. Analyzing over 30 open-source models across 10 families and 5 countries, we find that models generally exhibit a left-leaning bias, that the Overton window narrows with increasing model size, that geographic disparities are pronounced, and that jailbreak effectiveness varies significantly by model family. Our work provides a reproducible auditing pipeline to inform evidence-based governance of LLM safety.
📝 Abstract
As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.
Problem

Research questions and friction points this paper is trying to address.

large language models
political influence
red-teaming
Overton Window
jailbreak
Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming
Overton Windows
LLM jailbreak
political steerability
open-source LLMs
🔎 Similar Papers
No similar papers found.