Domain Pre-training Impact on Representations

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This study investigates the isolated impact of pretraining corpora on the representational quality of Transformer models, focusing exclusively on representations induced during the pure pretraining phase. We conduct contrastive learning pretraining across diverse, multi-source corpora—including small-scale domain-specific datasets—and evaluate representations using standardized protocols such as linear probing and similarity analysis. Our findings are threefold: (1) Domain adaptability exerts a stronger influence on representation quality than corpus size; high-quality, domain-specific data—even in limited quantity—suffices to induce robust, generalizable representations. (2) Distributional similarity between target tasks and pretraining corpora serves as a critical success criterion for mixed-corpus pretraining. (3) Under distribution-matching conditions, mixing domain-specific and general corpora significantly enhances downstream task performance. Collectively, these results provide both theoretical grounding and practical guidelines for efficient, low-resource pretraining strategies.

Technology Category

Application Category

📝 Abstract

This empirical study analyzes the effects of the pre-training corpus on the quality of learned transformer representations. We focus on the representation quality induced solely through pre-training. Our experiments show that pre-training on a small, specialized corpus can yield effective representations, and that the success of combining a generic and a specialized corpus depends on the distributional similarity between the target task and the specialized corpus.

Problem

Research questions and friction points this paper is trying to address.

Effects of pre-training corpus on transformer representations

Representation quality from pre-training alone

Impact of corpus similarity on combined pre-training success

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-training on small specialized corpus

Combining generic and specialized corpora

Focusing on distributional similarity

🔎 Similar Papers

No similar papers found.