🤖 AI Summary
To address the retrieval bottleneck posed by the exponential growth of scientific literature, this paper proposes CASPER, a concept-based sparse retrieval model. Methodologically, CASPER introduces (1) a fine-grained concept matching mechanism that jointly leverages tokens and key phrases; (2) automatic weakly supervised training data construction using heterogeneous citation signals—including titles, citation contexts, author-provided keywords, and co-citation patterns—to mitigate annotation scarcity; and (3) a unified sparse framework integrating retrieval and key phrase generation. Experiments demonstrate that CASPER consistently outperforms state-of-the-art dense and sparse baselines across eight scientific retrieval benchmarks. Moreover, with only lightweight post-processing, CASPER generates high-quality key phrases at 3.8× the speed of CopyRNN, while achieving significantly higher coverage and lexical diversity.
📝 Abstract
The exponential growth of scientific literature has made it increasingly difficult for researchers to keep up with the literature. In an attempt to alleviate this problem, we propose CASPER, a sparse retrieval model for scientific search that utilizes tokens and keyphrases as representation units (i.e. dimensions in the sparse embedding space), enabling it to represent queries and documents with research concepts and match them at both granular and conceptual levels. To overcome the lack of suitable training data, we propose mining training data by leveraging scholarly references (i.e. signals that capture how research concepts of papers are expressed in different settings), including titles, citation contexts, author-assigned keyphrases, and co-citations. CASPER outperforms strong dense and sparse retrieval baselines on eight scientific retrieval benchmarks. Moreover, we demonstrate that through simple post-processing, CASPER can be effectively used for the keyphrase generation tasks, achieving competitive performance with the established CopyRNN while producing more diverse keyphrases and being nearly four times faster.