🤖 AI Summary
Traditional keyword-based search struggles to uncover semantic trends and latent associations in large-scale biomedical literature. To address this, we propose a multidimensional interactive corpus exploration framework integrating hierarchical document clustering with dynamic facet search. Its core innovation lies in the first deep integration of clustering structure into the facet search process—enabling a paradigm shift from unidimensional retrieval to multidimensional, synergistic exploration. Coupled with semantics-enhanced query understanding and interactive visualization, the framework supports seamless navigation between corpus-level overviews and document-level details, along with dynamic query refinement. Evaluated on 4 million PubMed abstracts (2019–2022), our approach significantly improves users’ information discovery depth and task completion rates, while markedly enhancing latent knowledge extraction capability.
📝 Abstract
Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents ClusterTalk (The demo video and source code are available at: https://github.com/achouhan93/ClusterTalk), a framework for corpus exploration using multi-dimensional exploratory search. Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries. Compared to traditional one-dimensional search approaches like keyword search or clustering, this system improves the discoverability of information by encouraging a deeper interaction with the corpus. We demonstrate the functionality of the ClusterTalk framework based on four million PubMed abstracts for the four-year time frame.