🤖 AI Summary
To address the scarcity of high-quality training data for low-resource languages and the emerging saturation in large language model (LLM) training data efficacy, this paper introduces Contrastive Decoding—a novel paradigm for synthetic data generation. It leverages semantic divergence between outputs of high- and low-quality models on identical inputs, explicitly amplifying discrepancies to produce high-information, high-difficulty synthetic corpora, which are then mixed with real data for training. This approach requires no human annotation or hand-crafted rules, yet substantially enhances modeling of deep semantics and reasoning capabilities. Empirical results demonstrate that models trained with Contrastive Decoding outperform those trained on data generated via conventional sampling methods (e.g., top-k, nucleus sampling) across multilingual language modeling and diverse downstream reasoning tasks—including commonsense reasoning and mathematical deduction—with particularly pronounced gains for low-resource languages.
📝 Abstract
Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.