🤖 AI Summary
This work addresses the challenge that semantic aggregation may produce claims unsupported by underlying data, with verification hindered by limited context length, high computational cost, and intricate coupling between semantic and symbolic reasoning. The authors reformulate claim verification as a semantic query processing task, compiling natural language claims into executable queries evaluated directly on the original semantic engine. They introduce a unified framework integrating verification-aware optimizations—such as early termination, relevance-based ranking, and confidence sequence estimation—with conventional query optimization techniques. Leveraging semiring provenance theory, the approach generates minimal evidence sets sufficient for verification. Experiments on a real-world restaurant review dataset achieve perfect F1=1.00 while reducing computational cost by 3.2× and latency by 4.0× compared to baselines; even with a weak LLM, it matches or exceeds baseline performance at 1/63 the cost and 1/4.2 the latency.
📝 Abstract
With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natural language aggregate using an LLM. However, the resulting semantic aggregate may contain claims that are not grounded in the underlying relation. Verifying such claims is challenging: they often involve quantifiers, groupings, and comparisons over relations that far exceed LLM context windows and require a costly combination of semantic and symbolic processing.
We present Evergreen, a system that recasts claim verification as a semantic query processing task with tailored optimizations and provenance capture. Evergreen compiles each claim into a declarative semantic verification query and executes it on the same engine that produced the aggregate. To reduce cost and latency, Evergreen avoids unnecessary LLM calls through verification-aware optimizations (early stopping, relevance sorting, and estimation with confidence sequences) and general-purpose optimizations for semantic queries (operator fusion, similarity filtering, and prompt caching). Each verdict is accompanied by citations that identify a minimal set of tuples justifying the result, with semantics based on semiring provenance for first-order logic.
On a benchmark of real-world restaurant review datasets reflecting production-inspired workloads, Evergreen achieves excellent verification quality (F1 = 1.00) with a strong LLM while reducing cost by 3.2x and latency by 4.0x compared to unoptimized verification. Even with a significantly weaker LLM, Evergreen outperforms a strong LLM-as-a-judge baseline in F1 at 48x lower cost and 2.3x lower latency. Relative to a retrieval-augmented agent, Evergreen compares favorably in F1 and latency with similar cost when both use a strong LLM; yet, with a much weaker LLM, it achieves the same F1 at 63x lower cost and 4.2x lower latency.