DeGenTWeb: A First Look at LLM-dominant Websites

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the lack of systematic understanding regarding the prevalence and characteristics of large language model (LLM)-generated content on the web, as well as the limited effectiveness and transparency of existing detection methods in real-world webpage contexts. To bridge this gap, the authors propose DeGenTWeb, the first website-level framework for identifying LLM-generated content. DeGenTWeb adapts state-of-the-art detectors to better handle web-specific text and aggregates predictions across multiple pages to determine whether a site is predominantly LLM-generated. Large-scale analysis using Common Crawl and Bing search data reveals that LLM-dominated websites are already widespread and growing in proportion. Furthermore, the evaluation demonstrates a significant drop in accuracy of current detectors against content from the latest LLMs, highlighting their practical limitations.

📝 Abstract

Many recent news reports have claimed that content generated by large language models (LLMs) is taking over the web. However, these claims are typically not based on a representative sample of the web and the methodology underlying them is often opaque. Moreover, when aiming to minimize the chances of falsely attributing human-authored content to LLMs, we find that detectors of LLM-generated text perform much worse than advertised. Consequently, we lack an understanding of the true prevalence and characteristics of LLM content on the web. We describe DeGenTWeb which systematically identifies LLM-dominant websites: sites whose content has been generated using LLMs with little human input. We show how to adapt detectors of LLM-generated text for use on web pages, and how to aggregate detection results from multiple pages on a site for accurate site-level categorization. Using DeGenTWeb, we find that LLM-dominant sites are highly prevalent both in data from Common Crawl and in Bing's search results, and that this share is growing over time. We also show that continuing to accurately identify such sites appears challenging given the capabilities of the latest LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLM-generated content

web content prevalence

AI detection

large language models

content attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated content detection

web-scale analysis

site-level classification