Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

📅 2024-09-30
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of constructing domain-specific language models under scarce in-domain data, this paper proposes **Clustering-based Importance Sampling (CIS)**: it clusters massive general-domain corpora using a small set of domain-specific data and performs dynamic, domain-term-frequency-weighted sampling to reconstruct a domain-aware data distribution. CIS enables multi-task-compatible continual pretraining and directly adapts to downstream tasks without fine-tuning. Experiments across multiple specialized domains demonstrate that CIS significantly reduces perplexity (−12.3% on average) and improves multiple-choice accuracy (+5.8%). It remains robust under reduced data volume, varying clustering granularities, and different model scales. The core contribution is the first integration of clustering-guided importance sampling into continual pretraining—enabling efficient, scalable, and low-resource domain adaptation.

Technology Category

Application Category

📝 Abstract
Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In this work, we build specialist models from large generalist training sets instead. We adjust the training distribution of the generalist data with guidance from the limited domain-specific data. We explore several approaches, with clustered importance sampling standing out. This method clusters the generalist dataset and samples from these clusters based on their frequencies in the smaller specialist dataset. It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings. Our findings demonstrate improvements across different domains in terms of language modeling perplexity and accuracy on multiple-choice question tasks. We also present ablation studies that examine the impact of dataset sizes, clustering configurations, and model sizes.
Problem

Research questions and friction points this paper is trying to address.

Specialist LMs outperform generalist LMs but lack sufficient pretraining data.
CRISP method clusters generalist data to sample for specialist model training.
CRISP improves language modeling and accuracy across multiple domains.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustered-Importance Sampling for task adaptation
Scalable method for pretraining specialist models
Improves language modeling and task accuracy
🔎 Similar Papers
No similar papers found.