🤖 AI Summary
This work addresses the limitations of existing vector-based retrieval methods for unstructured documents, which suffer from coarse semantic matching, high computational overhead, and heavy reliance on frequent large language model (LLM) invocations. To overcome these challenges, the authors propose a novel retrieval paradigm grounded in structured annotation: SchemaBoot automatically induces multi-granularity document annotation schemas, enabling the construction of a Structured Semantic Retrieval (SSR) engine that integrates deep semantic understanding with SQL-like structured querying. This approach eliminates dependence on vector embeddings and LLM-based post-processing. Experimental results across three real-world datasets demonstrate that the proposed method substantially reduces LLM invocation frequency and retrieval costs while maintaining high accuracy, thereby confirming its advantages in precision, efficiency, and scalability.
📝 Abstract
Unstructured documents dominate enterprise and web data, but their lack of explicit organization hinders precise information retrieval. Current mainstream retrieval methods, especially embedding-based vector search, rely on coarse-grained semantic similarity, incurring high computational cost and frequent LLM calls for post-processing. To address this critical issue, we propose AnnoRetrieve, a novel retrieval paradigm that shifts from embeddings to structured annotations, enabling precise, annotation-driven semantic retrieval. Our system replaces expensive vector comparisons with lightweight structured queries over automatically induced schemas, dramatically reducing LLM usage and overall cost. The system integrates two synergistic core innovations: SchemaBoot, which automatically generates document annotation schemas via multi-granularity pattern discovery and constraint-based optimization, laying a foundation for annotation-driven retrieval and eliminating manual schema design, and Structured Semantic Retrieval (SSR), the core retrieval engine, which unifies semantic understanding with structured query execution; by leveraging the annotated structure instead of vector embeddings, SSR achieves precise semantic matching, seamlessly completing attribute-value extraction, table generation, and progressive SQL-based reasoning without relying on LLM interventions. This annotation-driven paradigm overcomes the limitations of traditional vector-based methods with coarse-grained matching and heavy LLM dependency and graph-based methods with high computational overhead. Experiments on three real-world datasets confirm that AnnoRetrieve significantly lowers LLM call frequency and retrieval cost while maintaining high accuracy. AnnoRetrieve establishes a new paradigm for cost-effective, precise, and scalable document analysis through intelligent structuring.