Retrieval Collapses When AI Pollutes the Web

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the threat posed by AI-generated content to the diversity and reliability of web information sources, introducing the novel concept of “retrieval collapse.” The phenomenon unfolds in two stages: high-quality AI content first dominates search results and erodes source diversity, subsequently paving the way for low-quality or harmful content. Through controlled experiments injecting both SEO-optimized and adversarial AI-generated content into retrieval corpora, the study quantifies the impact using BM25 and LLM-based rankers as baselines. Results show that contaminating 67% of the corpus can lead to over 80% of retrieved results being polluted. Under adversarial contamination, LLM-based rankers significantly outperform BM25 in suppressing exposure to harmful content—reducing it to approximately 19% compared to higher rates with BM25—highlighting the insidious risk of synthetic content silently displacing authentic sources.

Technology Category

Application Category

📝 Abstract

The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.

Problem

Research questions and friction points this paper is trying to address.

Retrieval Collapse

AI-generated content

information retrieval

RAG

web pollution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval Collapse

AI-generated content

Retrieval-Augmented Generation