Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora

📅 2024-01-26
🏛️ arXiv.org
📈 Citations: 8
Influential: 3
📄 PDF
🤖 AI Summary
Domain-specific large language models (LLMs) suffer from scarcity of high-quality training data and prohibitively high manual curation costs. To address this, we propose Query of CC: an automated domain-knowledge mining framework powered by LLMs. It introduces a novel “query-driven + bootstrapped retrieval” paradigm, integrating seed query generation, semantic-augmented retrieval, cross-domain knowledge filtering, and reasoning-chain identification to efficiently extract multi-disciplinary knowledge—particularly structured reasoning processes—from massive public corpora such as Common Crawl. Leveraging this framework, we construct KNOWLEDGE PILE, the first open-source, multi-disciplinary, high-quality domain-knowledge dataset covering mathematics, physics, history, and philosophy. Experiments demonstrate substantial improvements in model performance on mathematical reasoning and general knowledge-intensive tasks. Both code and the KNOWLEDGE PILE dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated remarkable potential in various tasks, however, there remains a significant scarcity of open-source models and data for specific domains. Previous works have primarily focused on manually specifying resources and collecting high-quality data on specific domains, which significantly consume time and effort. To address this limitation, we propose an efficient data collection method $ extit{Query of CC}$ based on large language models. This method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures. Through the application of this method, we have curated a high-quality dataset called KNOWLEDGE PILE, encompassing four major domains, including stem and humanities sciences, among others. Experimental results demonstrate that KNOWLEDGE PILE significantly improves the performance of large language models in mathematical and knowledge-related reasoning ability tests. To facilitate academic sharing, we open-source our dataset and code, providing valuable support to the academic community.
Problem

Research questions and friction points this paper is trying to address.

Lack of open-source models for specific domains
Manual data collection is time-consuming and labor-intensive
Need for efficient domain-specific knowledge retrieval from public corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to guide domain-specific data generation
Retrieving relevant data from Common Crawl
Creating Retrieve-Pile dataset for multiple domains
🔎 Similar Papers
No similar papers found.