🤖 AI Summary
Existing databases struggle to efficiently support hybrid queries over structured and unstructured (e.g., vector) data, resulting in poor performance for joint semantic retrieval and SQL execution. This paper introduces the first full-stack native hybrid query engine. Our approach addresses this challenge through three core innovations: (1) semantic-aware query classification and dynamic physical plan optimization; (2) customized physical operators that eliminate redundant computation; and (3) a JIT-compilation-based execution framework tailored for vector–relational hybrid workloads, integrating approximate nearest-neighbor indexing with a unified hybrid query optimizer. Evaluated on real-world datasets, our engine achieves end-to-end query speedups of 13%–7500× over state-of-the-art systems. These gains significantly enhance efficiency in multimodal recommendation and analytical scenarios requiring tight coupling of semantic and relational operations.
📝 Abstract
Querying both structured and unstructured data has become a new paradigm in data analytics and recommendation. With unstructured data, such as text and videos, are converted to high-dimensional vectors and queried with approximate nearest neighbor search (ANNS). State-of-the-art database systems implement vector search as a plugin in the relational query engine, which tries to utilize the ANN index to enhance performance. After investigating a broad range of hybrid queries, we find that such designs may miss potential optimization opportunities and achieve suboptimal performance for certain queries. In this paper, we propose CHASE, a query engine that is natively designed to support efficient hybrid queries on structured and unstructured data. CHASE performs specific designs and optimizations on multiple stages in query processing. First, semantic analysis is performed to categorize queries and optimize query plans dynamically. Second, new physical operators are implemented to avoid redundant computations, which is the case with existing operators. Third, compilation-based techniques are adopted for efficient machine code generation. Extensive evaluations using real-world datasets demonstrate that CHASE achieves substantial performance improvements, with speedups ranging from 13% to an extraordinary 7500 times compared to existing systems. These results highlight CHASE's potential as a robust solution for executing hybrid queries.