DEPT: Decoupled Embeddings for Pre-training Language Models

๐Ÿ“… 2024-10-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Multi-source heterogeneous text pre-training often suffers from cross-lingual and cross-domain negative interference (โ€œmultilingual curseโ€) and incurs prohibitive communication and memory overhead. Method: We propose the first vocabulary-agnostic federated large language model pre-training framework: it decouples token embeddings from the Transformer backbone, assigns each data source an independent vocabulary for parallel training, and dynamically allocates and loads embedding parameters on-demand. Contribution/Results: This design reduces communication costs by several orders of magnitude and cuts embedding memory consumption by 4โ€“5ร—. Simultaneously, backbone generalization and training robustness improve. Experiments show a 20% average reduction in perplexity and significant gains across downstream tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Language Model pre-training uses broad data mixtures to enhance performance across domains and languages. However, training on such heterogeneous text corpora requires extensive and expensive efforts. Since these data sources vary significantly in lexical, syntactic, and semantic aspects, they cause negative interference or the ``curse of multilinguality''. To address these challenges we propose a communication-efficient pre-training framework, DEPT. Our method decouples embeddings from the transformer body while simultaneously training the latter on multiple data sources without requiring a shared vocabulary. DEPT can: (1) train robustly and effectively under significant data heterogeneity, (2) minimize token embedding parameters to only what the data source vocabulary requires, while cutting communication costs in direct proportion to both the communication frequency and the reduction in parameters, (3) enhance transformer body plasticity and generalization, improving both average perplexity (up to 20%) and downstream task performance, and (4) enable training with custom optimized vocabularies per data source. We demonstrate DEPT's potential via the first vocabulary-agnostic federated pre-training of billion-scale models, reducing communication costs by orders of magnitude and embedding memory by 4-5x.
Problem

Research questions and friction points this paper is trying to address.

Addresses negative interference in heterogeneous text corpora training
Reduces communication costs and embedding memory usage
Enhances transformer plasticity and generalization for better performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples embeddings from transformer body
Minimizes token embedding parameters efficiently
Enables custom vocabularies per data source
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Alexandru Iacob
Department of Computer Science and Technology, University of Cambridge, Flower Labs
Lorenzo Sani
Lorenzo Sani
PhD student in Computer Science, University of Cambridge
federated learningmachine learninglarge-scale mlmobile systemsphysics
Meghdad Kurmanji
Meghdad Kurmanji
Postdoctoral Researcher - University of Cambridge
Machine LearningMachine UnlearningFederated Learning
William F. Shen
William F. Shen
University of Cambridge
AIML
Xinchi Qiu
Xinchi Qiu
Meta, University of Cambridge
GenAIPrivacy-preserving MLAI RobustnessML Systems
D
Dongqi Cai
Department of Computer Science and Technology, University of Cambridge, Beijing University of Posts and Telecommunications
Y
Yan Gao
Department of Computer Science and Technology, University of Cambridge
N
N. D. Lane
Department of Computer Science and Technology, University of Cambridge, Flower Labs