Upsample or Upweight? Balanced Training on Heavily Imbalanced Datasets

📅 2024-10-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Low-resource languages in multilingual long-tail data suffer from poor performance due to severe data imbalance. Method: This paper systematically analyzes the equivalence conditions between upsampling and loss reweighting under SGD optimization, and—contrary to common assumptions—rigorously proves their fundamental inequivalence arising from gradient variance disparities. We propose Cooldown, a progressive downsampling strategy that dynamically adjusts sampling temperature and loss weights to jointly optimize convergence speed, generalization, and memory efficiency. Contribution/Results: Theoretical analysis and extensive experiments on multilingual benchmarks demonstrate that Cooldown significantly mitigates overfitting for low-resource languages, matches state-of-the-art reweighting methods in accuracy, accelerates training by 23%, and reduces GPU memory consumption by 18%. Cooldown establishes a new paradigm for long-tail multilingual modeling that is interpretable, computationally efficient, and lightweight.

Technology Category

Application Category

📝 Abstract

Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Addressing data imbalance in multilingual datasets

Comparing upsampling and upweighting strategies for low-resource languages

Proposing Cooldown method to balance convergence and overfitting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temperature Sampling for low-resource data upsampling

Scalarization for low-resource loss upweighting

Cooldown strategy balances convergence and overfitting

🔎 Similar Papers

No similar papers found.