Preprint: Did I Just Browse A Website Written by LLMs?

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current detectors for LLM-generated web content lack reliability on real-world websites due to their reliance on clean, unstructured prose and inability to handle HTML markup diversity and genre heterogeneity—exacerbating transparency and identifiability challenges. Method: We propose the first website-level detection framework: it aggregates page-level predictions via voting and ensemble strategies to classify entire domains; introduces robust preprocessing to handle noisy, multimodal web text; and incorporates domain-aware features to mitigate markup-induced artifacts. Contribution/Results: We curate two real-world, human-annotated website datasets (120 sites total), achieving 100% accuracy in domain-level classification. For the first time, we systematically identify LLM-dominant websites at scale—detecting 1,247 such sites among 10,000 search engine results and 893 among 10,000 Common Crawl domains—revealing both prevalence and upward trend. Our method demonstrates high reliability, strong robustness against adversarial markup and stylistic variation, and linear scalability, establishing a new paradigm for web provenance auditing and trustworthy content governance.

Technology Category

Application Category

📝 Abstract

Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are insufficient, because they perform well mainly on clean, prose-like text, while web content has complex markup and diverse genres. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.

Problem

Research questions and friction points this paper is trying to address.

Detecting LLM-generated web content reliably

Addressing limitations of current LLM detectors on complex web content

Assessing prevalence and impact of LLM-dominant websites

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline classifies websites using LLM text detectors

Uses multiple prose-like pages for accurate detection

Achieves 100% accuracy on ground truth datasets

🔎 Similar Papers

No similar papers found.