Retrieval Collapses When AI Pollutes the Web

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the threat posed by AI-generated content to the diversity and reliability of web information sources, introducing the novel concept of “retrieval collapse.” The phenomenon unfolds in two stages: high-quality AI content first dominates search results and erodes source diversity, subsequently paving the way for low-quality or harmful content. Through controlled experiments injecting both SEO-optimized and adversarial AI-generated content into retrieval corpora, the study quantifies the impact using BM25 and LLM-based rankers as baselines. Results show that contaminating 67% of the corpus can lead to over 80% of retrieved results being polluted. Under adversarial contamination, LLM-based rankers significantly outperform BM25 in suppressing exposure to harmful content—reducing it to approximately 19% compared to higher rates with BM25—highlighting the insidious risk of synthetic content silently displacing authentic sources.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.
Problem

Research questions and friction points this paper is trying to address.

Retrieval Collapse
AI-generated content
information retrieval
RAG
web pollution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval Collapse
AI-generated content
Retrieval-Augmented Generation
adversarial contamination
LLM-based rankers
🔎 Similar Papers
2024-06-27Journal of Mathematical & Computer ApplicationsCitations: 2
H
Hongyeon Yu
NAVER Corp.
Dongchan Kim
Dongchan Kim
MSci student, University of Maryland, Baltimore county
Y
Young-Bum Kim
NAVER Corp.