How Far Will They Go? Red-Teaming Online Influence with Large Language Models

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study evaluates the risk of misuse of open-source large language models (LLMs) in political influence operations, focusing on their capacity to express political viewpoints on contentious issues and their susceptibility to natural-language jailbreaking. We introduce the first systematic red-teaming framework to quantify each model’s “Overton window”—the range of political positions it can stably articulate—and examine its relationship with model scale, geographic origin, and model family. Analyzing over 30 open-source models across 10 families and 5 countries, we find that models generally exhibit a left-leaning bias, that the Overton window narrows with increasing model size, that geographic disparities are pronounced, and that jailbreak effectiveness varies significantly by model family. Our work provides a reproducible auditing pipeline to inform evidence-based governance of LLM safety.

📝 Abstract

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

Problem

Research questions and friction points this paper is trying to address.

large language models

political influence

red-teaming

Overton Window

jailbreak

Innovation

Methods, ideas, or system contributions that make the work stand out.

red-teaming

Overton Windows

LLM jailbreak