🤖 AI Summary
This study addresses the limitations of traditional keyword-based search in social science data discovery, which struggles with natural language expressions, spelling errors, geographic context, and complex queries. The authors develop and evaluate a large language model (LLM)-based semantic search system, conducting the first systematic comparison against the UK’s Consumer Data Research Centre (CDRC) keyword tool using real user queries. Through multi-dimensional analysis—including BERT embedding cosine similarity, Jaccard index, exact match rate, and human evaluation—of 131 high-frequency queries, the results demonstrate that semantic search significantly outperforms keyword-based methods for place names, misspellings, niche topics, and intricate queries, yielding richer and more relevant results. Although coverage differs slightly between approaches, their complementary strengths suggest potential for combined use to enhance overall data discovery effectiveness.
📝 Abstract
This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs particularly well for place based, misspelled, obscure, or complex queries. While the semantic search does not capture all keyword based results, the datasets returned are overwhelmingly semantically similar, with high cosine similarity scores despite lower exact overlap. Rankings of the most relevant results differ substantially between tools, reflecting contrasting prioritisation strategies. Case studies demonstrate that the LLM based tool is robust to spelling errors, interprets geographic and contextual relevance effectively, and supports natural-language queries that keyword search fails to resolve. Overall, the findings suggest that LLM driven semantic search offers a substantial improvement for data discovery, complementing rather than fully replacing traditional keyword-based approaches.