🤖 AI Summary
This work addresses the challenge that existing retrieval systems struggle to effectively handle “oblique queries”—those embedding implicit intents or latent patterns—resulting in low recall of relevant documents. The study formally defines oblique queries for the first time and introduces OBLIQ-Bench, a benchmark comprising five realistic long-tail retrieval tasks (e.g., implicit stance detection, fault pattern identification, and abstract scenario matching). Through systematic evaluation of end-to-end retrieval performance, experiments reveal a critical asymmetry: while large language models can accurately verify document relevance, current retrievers fail to recall these documents, exposing a fundamental bottleneck in state-of-the-art systems’ ability to capture implicit semantic cues.
📝 Abstract
Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search problems over real long-tail corpora. OBLIQ-Bench exposes an overlooked asymmetry between retrieval and verification, where reasoning LLMs reliably recognize latent relevance whenever relevant documents are surfaced, but even sophisticated retrieval pipelines fail to surface most relevant documents in the first place. We hope that OBLIQ-Bench will drive research into retrieval architectures that efficiently capture latent patterns and implicit signals in large corpora.