Domain Pre-training Impact on Representations

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the isolated impact of pretraining corpora on the representational quality of Transformer models, focusing exclusively on representations induced during the pure pretraining phase. We conduct contrastive learning pretraining across diverse, multi-source corpora—including small-scale domain-specific datasets—and evaluate representations using standardized protocols such as linear probing and similarity analysis. Our findings are threefold: (1) Domain adaptability exerts a stronger influence on representation quality than corpus size; high-quality, domain-specific data—even in limited quantity—suffices to induce robust, generalizable representations. (2) Distributional similarity between target tasks and pretraining corpora serves as a critical success criterion for mixed-corpus pretraining. (3) Under distribution-matching conditions, mixing domain-specific and general corpora significantly enhances downstream task performance. Collectively, these results provide both theoretical grounding and practical guidelines for efficient, low-resource pretraining strategies.

Technology Category

Application Category

📝 Abstract
This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.
Problem

Research questions and friction points this paper is trying to address.

Effects of pre-training corpus on transformer representations
Representation quality from pre-training alone
Impact of corpus similarity on combined pre-training success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-training on small specialized corpus
Combining generic and specialized corpora
Focusing on distributional similarity
🔎 Similar Papers
No similar papers found.
C
Cesar Gonzalez-Gutierrez
Universitat Politècnica de Catalunya, Barcelona, Spain
Ariadna Quattoni
Ariadna Quattoni
dMetrics, USA
machine learningcomputer visionnatural language processing