🤖 AI Summary
This work addresses the threat posed by AI-generated content to the diversity and reliability of web information sources, introducing the novel concept of “retrieval collapse.” The phenomenon unfolds in two stages: high-quality AI content first dominates search results and erodes source diversity, subsequently paving the way for low-quality or harmful content. Through controlled experiments injecting both SEO-optimized and adversarial AI-generated content into retrieval corpora, the study quantifies the impact using BM25 and LLM-based rankers as baselines. Results show that contaminating 67% of the corpus can lead to over 80% of retrieved results being polluted. Under adversarial contamination, LLM-based rankers significantly outperform BM25 in suppressing exposure to harmful content—reducing it to approximately 19% compared to higher rates with BM25—highlighting the insidious risk of synthetic content silently displacing authentic sources.
📝 Abstract
The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.