ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing information retrieval (IR) benchmarks primarily focus on semantic matching for single- or multi-hop queries, failing to adequately evaluate models’ ability to handle complex logical queries involving first-order logic operations—such as conjunction, disjunction, and negation. Method: We introduce ComLQ, the first benchmark tailored for complex logical querying, comprising 2,909 structured logical queries and 11,251 candidate documents. We propose a subgraph-guided LLM-based data construction method to ensure logical structural alignment between queries and documents, and design LSNC@K—a novel metric quantifying retrieval consistency under negation. Data quality is ensured via GPT-4o generation, expert verification, and subgraph-informed prompting. Contribution/Results: Zero-shot evaluations reveal that state-of-the-art retrieval models exhibit significant performance degradation on negation-heavy queries, confirming ComLQ’s rigor and its value in exposing critical limitations in current IR systems.

Technology Category

Application Category

📝 Abstract

Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking emph{complex logical queries} involving first-order logic operations such as conjunction ($land$), disjunction ($lor$), and negation ($lnot$). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset extbf{ComLQ} for extbf{Com}plex extbf{L}ogical extbf{Q}ueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured emph{structure conformity} and emph{evidence distribution} through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, extbf{Log-Scaled Negation Consistency} ( extbf{LSNC@$K$}). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@$K$ measures whether top-$K$ retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.

Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of benchmarks for complex logical queries in information retrieval systems

Developing a dataset using LLMs to handle conjunction, disjunction, and negation operations

Proposing new metrics to evaluate retrieval performance on queries with negation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to generate complex logical queries

Subgraph-guided prompts ensure query structure conformity

Introducing log-scaled negation consistency evaluation metric

🔎 Similar Papers

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval