🤖 AI Summary
Low-resource languages in multilingual long-tail data suffer from poor performance due to severe data imbalance. Method: This paper systematically analyzes the equivalence conditions between upsampling and loss reweighting under SGD optimization, and—contrary to common assumptions—rigorously proves their fundamental inequivalence arising from gradient variance disparities. We propose Cooldown, a progressive downsampling strategy that dynamically adjusts sampling temperature and loss weights to jointly optimize convergence speed, generalization, and memory efficiency. Contribution/Results: Theoretical analysis and extensive experiments on multilingual benchmarks demonstrate that Cooldown significantly mitigates overfitting for low-resource languages, matches state-of-the-art reweighting methods in accuracy, accelerates training by 23%, and reduces GPU memory consumption by 18%. Cooldown establishes a new paradigm for long-tail multilingual modeling that is interpretable, computationally efficient, and lightweight.
📝 Abstract
Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed towards high-resource languages. Two common strategies to address this disparity are upsampling low-resource data (Temperature Sampling) and upweighting low-resource loss (Scalarization). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation. Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under full gradient descent but differ under stochastic gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation compared to Scalarization, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose Cooldown, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting -- achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.