🤖 AI Summary
Existing web content extraction methods suffer from three key limitations at scale: low efficiency (high latency of generative models), poor adaptability (weak generalization of rule-based approaches), and structural neglect (semantic loss in HTML due to chunking and re-ranking). To address these, this paper proposes a novel indexing-based paradigm that reformulates content extraction as a structure-aware discriminative index prediction task—shifting from generation to precise localization. Our method introduces an HTML-structure-aware segmentation mechanism and an addressable fragment indexing scheme, enabling lightweight discriminative models to directly predict the positions of query-relevant fragments. This fully decouples extraction latency from webpage length. To our knowledge, this is the first work to achieve such a paradigm shift. Extensive experiments demonstrate state-of-the-art performance across three tasks—RAG-based QA, main-content extraction, and query-relevant extraction—delivering higher accuracy (improved match rates), lower latency (faster inference), and stronger robustness.
📝 Abstract
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.