ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the prohibitively high computational cost of LLM-based semantic predicate analysis and reasoning over large-scale document collections, this paper proposes a two-stage efficient framework. In the first stage, an offline semantic index is constructed: documents are encoded into semantic representations using an LLM, and a lightweight surrogate model is trained via contrastive learning to approximate LLM semantics. In the second stage, an online adaptive cascaded filtering mechanism dynamically prunes candidate documents, drastically reducing the number of samples requiring LLM-based re-ranking. The core innovation lies in the co-design of the surrogate model and the cascaded filtering strategy, achieving Pareto-optimal trade-offs between accuracy and efficiency. Evaluated on three benchmark datasets, the framework improves end-to-end processing speed by over 2× and reduces LLM invocations by 85%, significantly enhancing scalability and practicality for large-scale semantic querying.

Technology Category

Application Category

📝 Abstract

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce extsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, extsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, extsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that extsc{ScaleDoc} achieves over a 2$ imes$ end-to-end speedup and reduces expensive LLM invocations by up to 85%, making large-scale semantic analysis practical and efficient.

Problem

Research questions and friction points this paper is trying to address.

Scaling LLM-based semantic predicates over large document collections

Reducing high inference costs of LLMs for document filtering

Enabling efficient semantic analysis on unstructured documents with ad-hoc queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline semantic representation generation using LLM

Online lightweight proxy model for document filtering

Contrastive learning and adaptive cascade mechanisms

🔎 Similar Papers

No similar papers found.