🤖 AI Summary
Existing information retrieval (IR) benchmarks primarily focus on semantic matching for single- or multi-hop queries, failing to adequately evaluate models’ ability to handle complex logical queries involving first-order logic operations—such as conjunction, disjunction, and negation.
Method: We introduce ComLQ, the first benchmark tailored for complex logical querying, comprising 2,909 structured logical queries and 11,251 candidate documents. We propose a subgraph-guided LLM-based data construction method to ensure logical structural alignment between queries and documents, and design LSNC@K—a novel metric quantifying retrieval consistency under negation. Data quality is ensured via GPT-4o generation, expert verification, and subgraph-informed prompting.
Contribution/Results: Zero-shot evaluations reveal that state-of-the-art retrieval models exhibit significant performance degradation on negation-heavy queries, confirming ComLQ’s rigor and its value in exposing critical limitations in current IR systems.
📝 Abstract
Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking emph{complex logical queries} involving first-order logic operations such as conjunction ($land$), disjunction ($lor$), and negation ($lnot$). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset extbf{ComLQ} for extbf{Com}plex extbf{L}ogical extbf{Q}ueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured emph{structure conformity} and emph{evidence distribution} through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, extbf{Log-Scaled Negation Consistency} ( extbf{LSNC@$K$}). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@$K$ measures whether top-$K$ retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.