π€ AI Summary
Scientific document retrieval faces significant challenges due to the scarcity of domain-specific labeled data and the highly specialized nature of technical terminology, which often leads existing methods to suffer from conceptual redundancy or insufficient coverage. To address these limitations, this work proposes an academic concept indexing framework that integrates a structured scholarly taxonomy with large language models to extract and organize key concepts. The framework introduces two novel mechanisms: Concept-Coverage-aware Query Generation (CCQGen) and Concept-Focused Context Expansion (CCExpand), which jointly enhance the retrieval systemβs capacity to understand and match scientific semantics. Experimental results demonstrate that the proposed approach substantially improves query quality, concept alignment, and overall retrieval effectiveness, outperforming current state-of-the-art methods on scientific document retrieval benchmarks.
π Abstract
Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations and the substantial mismatch in vocabulary and information needs. Recent approaches address these issues through two independent directions that leverage large language models (LLMs): (1) generating synthetic queries for fine-tuning, and (2) generating auxiliary contexts to support relevance matching. However, both directions overlook the diverse academic concepts embedded within scientific documents, often producing redundant or conceptually narrow queries and contexts. To address this limitation, we introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy. This structured index serves as a foundation for improving both directions. First, we enhance the synthetic query generation with concept coverage-based generation (CCQGen), which adaptively conditions LLMs on uncovered concepts to generate complementary queries with broader concept coverage. Second, we strengthen the context augmentation with concept-focused auxiliary contexts (CCExpand), which leverages a set of document snippets that serve as concise responses to the concept-aware CCQGen queries. Extensive experiments show that incorporating the academic concept index into both query generation and context augmentation leads to higher-quality queries, better conceptual alignment, and improved retrieval performance.