๐ค AI Summary
Current systematic literature review (SLR) abstract screening methods lack fine-grained, zero-shot relevance ranking capabilities.
Method: We propose a tuning-free two-stage framework: (1) explicitly modeling the SLR research question and inclusion/exclusion criteria as structured prompts for large language models (LLMs) to enable fine-grained, interpretable relevance scoring; and (2) integrating dense re-ranking via a contrastive-learning-finetuned bi-encoder to mitigate error propagation inherent in question-answeringโbased cascaded ranking.
Contribution/Results: Evaluated on a benchmark of 57 medical-domain SLRs, our approach achieves an average mean average precision (MAP) improvement of 5โ10 percentage points over state-of-the-art QA-based ranking methods. The implementation code and annotated dataset are publicly released.
๐ Abstract
The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR's inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.