Enhancing LLMs via High-Knowledge Data Selection

📅 2025-04-11

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Pretrained language models (LLMs) suffer from performance bottlenecks due to knowledge scarcity in pretraining corpora. Method: This paper proposes a gradient-free High-Knowledge Scorer (HKS), which evaluates data quality along explicit knowledge dimensions—namely, knowledge density and coverage—to construct a multi-domain knowledge element pool. HKS enables both general-purpose and domain-specific high-quality data selection without requiring model fine-tuning or gradient computation. Contribution/Results: HKS introduces the first knowledge-aware data quality assessment paradigm, supporting cross-domain adaptability and domain-adaptive data selection. Empirical results demonstrate that models trained on HKS-filtered data achieve significant gains on knowledge-intensive tasks (e.g., factual QA, reasoning) and general understanding benchmarks. Moreover, domain-specialized variants of HKS yield measurable improvements in vertical domains—including medicine and law—enhancing domain-specific expertise and task accuracy.

Technology Category

Application Category

📝 Abstract

The performance of Large Language Models (LLMs) is intrinsically linked to the quality of its training data. Although several studies have proposed methods for high-quality data selection, they do not consider the importance of knowledge richness in text corpora. In this paper, we propose a novel and gradient-free High-Knowledge Scorer (HKS) to select high-quality data from the dimension of knowledge, to alleviate the problem of knowledge scarcity in the pre-trained corpus. We propose a comprehensive multi-domain knowledge element pool and introduce knowledge density and coverage as metrics to assess the knowledge content of the text. Based on this, we propose a comprehensive knowledge scorer to select data with intensive knowledge, which can also be utilized for domain-specific high-knowledge data selection by restricting knowledge elements to the specific domain. We train models on a high-knowledge bilingual dataset, and experimental results demonstrate that our scorer improves the model's performance in knowledge-intensive and general comprehension tasks, and is effective in enhancing both the generic and domain-specific capabilities of the model.

Problem

Research questions and friction points this paper is trying to address.

Selecting high-knowledge data to improve LLM performance

Addressing knowledge scarcity in pre-trained corpora

Enhancing domain-specific and generic model capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-free High-Knowledge Scorer for data selection

Multi-domain knowledge element pool as metrics

Bilingual dataset training enhances model performance

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey