ComLQ: Benchmarking Complex Logical Queries in Information Retrieval

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing information retrieval (IR) benchmarks primarily focus on semantic matching for single- or multi-hop queries, failing to adequately evaluate models’ ability to handle complex logical queries involving first-order logic operations—such as conjunction, disjunction, and negation. Method: We introduce ComLQ, the first benchmark tailored for complex logical querying, comprising 2,909 structured logical queries and 11,251 candidate documents. We propose a subgraph-guided LLM-based data construction method to ensure logical structural alignment between queries and documents, and design LSNC@K—a novel metric quantifying retrieval consistency under negation. Data quality is ensured via GPT-4o generation, expert verification, and subgraph-informed prompting. Contribution/Results: Zero-shot evaluations reveal that state-of-the-art retrieval models exhibit significant performance degradation on negation-heavy queries, confirming ComLQ’s rigor and its value in exposing critical limitations in current IR systems.

Technology Category

Application Category

📝 Abstract
Information retrieval (IR) systems play a critical role in navigating information overload across various applications. Existing IR benchmarks primarily focus on simple queries that are semantically analogous to single- and multi-hop relations, overlooking emph{complex logical queries} involving first-order logic operations such as conjunction ($land$), disjunction ($lor$), and negation ($lnot$). Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset extbf{ComLQ} for extbf{Com}plex extbf{L}ogical extbf{Q}ueries, which comprises 2,909 queries and 11,251 candidate passages. A key challenge in constructing the dataset lies in capturing the underlying logical structures within unstructured text. Therefore, by designing the subgraph-guided prompt with the subgraph indicator, an LLM (such as GPT-4o) is guided to generate queries with specific logical structures based on selected passages. All query-passage pairs in ComLQ are ensured emph{structure conformity} and emph{evidence distribution} through expert annotation. To better evaluate whether retrievers can handle queries with negation, we further propose a new evaluation metric, extbf{Log-Scaled Negation Consistency} ( extbf{LSNC@$K$}). As a supplement to standard relevance-based metrics (such as nDCG and mAP), LSNC@$K$ measures whether top-$K$ retrieved passages violate negation conditions in queries. Our experimental results under zero-shot settings demonstrate existing retrieval models' limited performance on complex logical queries, especially on queries with negation, exposing their inferior capabilities of modeling exclusion.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of benchmarks for complex logical queries in information retrieval systems
Developing a dataset using LLMs to handle conjunction, disjunction, and negation operations
Proposing new metrics to evaluate retrieval performance on queries with negation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to generate complex logical queries
Subgraph-guided prompts ensure query structure conformity
Introducing log-scaled negation consistency evaluation metric
🔎 Similar Papers
No similar papers found.
G
Ganlin Xu
School of Data Science, Fudan University, Shanghai, China
Z
Zhitao Yin
School of Data Science, Fudan University, Shanghai, China
L
Linghao Zhang
School of Data Science, Fudan University, Shanghai, China
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Weijia Lu
Weijia Lu
Senior Research Scientist, AI Lab, Tencent
Artificial IntelligenceSignal ProcessingFEMElectrophysiologyUltrasonics
X
Xiaodong Zhang
United Automotive Electronic Systems, Shanghai, China
Zhifei Yang
Zhifei Yang
Peking University
3D GenerationGenerative Models
Sihang Jiang
Sihang Jiang
Fudan University
Knowledge GraphLarge Language Models
Deqing Yang
Deqing Yang
School of Data Science, Fudan University