🤖 AI Summary
Current scientific paper retrieval predominantly relies on abstract-based modeling, which fails to capture multi-dimensional semantics across full texts, resulting in insufficient fine-grained matching capability. To address this, we propose a full-text segment-aware, multi-perspective document retrieval framework: it decomposes query papers into semantic views (e.g., method, experiment, conclusion) and performs aspect-aligned, fine-grained matching against corresponding sections of candidate papers; further, it integrates dense retrieval with paragraph-level semantic modeling via a multi-perspective query optimization mechanism and aspect-specific embedding strategy. We introduce SciFullBench—the first benchmark supporting full-text segment-level contrastive evaluation. On SciFullBench, our approach achieves an average 4.3% improvement in NDCG@10 over state-of-the-art baselines, significantly enhancing the discovery of deep cross-paper semantic associations.
📝 Abstract
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity across them, although abstracts provide only sparse and high-level summaries. To address this, we propose PRISM, a novel document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers. In particular, each query paper is decomposed into multiple aspect-specific views and individually embedded, which are then matched against candidate papers similarity segmented to consider their multifaceted dimensions. Moreover, we present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available. Then, experimental results show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.