Reassessing Large Language Model Boolean Query Generation for Systematic Reviews

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses inconsistent Boolean query generation in LLM-assisted systematic reviews—a critical bottleneck overlooked in prior research. We identify three previously neglected factors: (1) validation of query effectiveness, (2) strict output format constraints, and (3) strategic selection of chain-of-thought (CoT) examples. Through multi-model comparative experiments (e.g., ChatGPT), structured prompt engineering, and rigorous recall/precision evaluation, we empirically demonstrate— for the first time—that prompt design and model selection are the dominant determinants of query quality. We propose a guided generation paradigm leveraging high-quality seed literature, which substantially improves retrieval performance, achieving up to a 37% gain in recall. Furthermore, we establish a reproducible evaluation framework with standardized metrics and protocols. This work provides both methodological foundations and practical implementation pathways for automating systematic reviews in evidence-based medicine.

Technology Category

Application Category

📝 Abstract

Systematic reviews are comprehensive literature reviews that address highly focused research questions and represent the highest form of evidence in medicine. A critical step in this process is the development of complex Boolean queries to retrieve relevant literature. Given the difficulty of manually constructing these queries, recent efforts have explored Large Language Models (LLMs) to assist in their formulation. One of the first studies,Wang et al., investigated ChatGPT for this task, followed by Staudinger et al., which evaluated multiple LLMs in a reproducibility study. However, the latter overlooked several key aspects of the original work, including (i) validation of generated queries, (ii) output formatting constraints, and (iii) selection of examples for chain-of-thought (Guided) prompting. As a result, its findings diverged significantly from the original study. In this work, we systematically reproduce both studies while addressing these overlooked factors. Our results show that query effectiveness varies significantly across models and prompt designs, with guided query formulation benefiting from well-chosen seed studies. Overall, prompt design and model selection are key drivers of successful query formulation. Our findings provide a clearer understanding of LLMs' potential in Boolean query generation and highlight the importance of model- and prompt-specific optimisations. The complex nature of systematic reviews adds to challenges in both developing and reproducing methods but also highlights the importance of reproducibility studies in this domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for Boolean query generation in systematic reviews

Addressing reproducibility gaps in prior LLM-based query studies

Optimizing prompt design and model selection for effective queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic reproduction of prior LLM Boolean query studies

Validation of generated queries with formatting constraints

Optimized prompt design and model selection strategies

🔎 Similar Papers

No similar papers found.