🤖 AI Summary
This work investigates efficient training strategies for models to acquire low-frequency compositional reasoning skills under long-tailed power-law distributions. Focusing on compositional reasoning tasks, the authors compare power-law and uniform data sampling strategies through theoretical sample complexity analysis and loss landscape characterization. They demonstrate that power-law sampling introduces a beneficial asymmetry, enabling models to first master high-frequency skills and subsequently transfer this knowledge effectively to rare compositions. Empirical results on tasks such as state tracking and multi-step arithmetic show that power-law training substantially outperforms uniform sampling. Theoretically, this approach requires fewer samples to achieve comparable or superior generalization, establishing a sample-efficient paradigm for long-tailed compositional generalization.
📝 Abstract
Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.