When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the detectability bottleneck of AI-generated social media text under realistic threat scenarios—specifically, when adversaries employ privately fine-tuned large language models (LLMs) to produce short, informal posts without disclosing model identities. Method: The authors construct a large-scale, diverse dataset comprising over 500,000 AI-generated tweets spanning multiple LLMs (Llama, GPT, Claude) and custom fine-tuned variants across varied topics, and systematically benchmark state-of-the-art detectors—including DetectGPT and RoBERTa-Detector—alongside human evaluators. Contribution/Results: Results reveal, for the first time, that private fine-tuning severely degrades detector robustness, causing average F1 drops exceeding 40%; human accuracy falls to only 52.3%, near chance level. The work fundamentally challenges the “model-known” assumption underlying current detection paradigms, establishing a critical empirical benchmark and methodological warning for AI content governance.

Technology Category

Application Category

📝 Abstract
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated social media text is challenging due to short, informal content.
Fine-tuned AI models evade detection when attackers withhold their models.
Current detection methods fail against sophisticated AI-generated influence campaigns.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs generate human-like social media text
Dataset of 505,159 AI posts from diverse models
Detection drops without access to fine-tuned models