ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Traditional keyword-based search struggles to uncover semantic trends and latent associations in large-scale biomedical literature. To address this, we propose a multidimensional interactive corpus exploration framework integrating hierarchical document clustering with dynamic facet search. Its core innovation lies in the first deep integration of clustering structure into the facet search process—enabling a paradigm shift from unidimensional retrieval to multidimensional, synergistic exploration. Coupled with semantics-enhanced query understanding and interactive visualization, the framework supports seamless navigation between corpus-level overviews and document-level details, along with dynamic query refinement. Evaluated on 4 million PubMed abstracts (2019–2022), our approach significantly improves users’ information discovery depth and task completion rates, while markedly enhancing latent knowledge extraction capability.

Technology Category

Application Category

📝 Abstract

Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents ClusterTalk (The demo video and source code are available at: https://github.com/achouhan93/ClusterTalk), a framework for corpus exploration using multi-dimensional exploratory search. Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries. Compared to traditional one-dimensional search approaches like keyword search or clustering, this system improves the discoverability of information by encouraging a deeper interaction with the corpus. We demonstrate the functionality of the ClusterTalk framework based on four million PubMed abstracts for the four-year time frame.

Problem

Research questions and friction points this paper is trying to address.

Challenges in exploring large-scale text corpora in specialized domains

Limitations of traditional keyword-based search methods

Need for multi-feature search capabilities in corpus exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-based organization with textual embeddings

Lexical and semantic search integration

Timeline-driven exploration and QA

🔎 Similar Papers

No similar papers found.