🤖 AI Summary
Current internet architecture is document-centric and optimized for human browsing, rendering it ill-suited for AI-driven fine-grained semantic retrieval—leading to bandwidth waste, degraded information quality, and increased development complexity. This paper proposes the “AI-Native Internet” paradigm, wherein servers natively expose semantically structured information blocks instead of monolithic HTML documents. We design a web-native semantic parsing protocol and a lightweight, large language model–based parser to precisely locate target semantic units prior to retrieval. Leveraging an HTML comparative evaluation framework, we quantitatively demonstrate the inefficiency of conventional webpage parsing for semantic retrieval tasks, and identify key technical directions: semantic chunking, protocol extensibility, and parser lightweighting. Our work establishes both a theoretical foundation and a practical framework for building a next-generation internet infrastructure that is semantics-driven, efficient, and trustworthy.
📝 Abstract
The rise of Generative AI Search is fundamentally transforming how users and intelligent systems interact with the Internet. LLMs increasingly act as intermediaries between humans and web information. Yet the web remains optimized for human browsing rather than AI-driven semantic retrieval, resulting in wasted network bandwidth, lower information quality, and unnecessary complexity for developers. We introduce the concept of an AI-Native Internet, a web architecture in which servers expose semantically relevant information chunks rather than full documents, supported by a Web-native semantic resolver that allows AI applications to discover relevant information sources before retrieving fine-grained chunks. Through motivational experiments, we quantify the inefficiencies of current HTML-based retrieval, and outline architectural directions and open challenges for evolving today's document-centric web into an AI-oriented substrate that better supports semantic access to web content.