The Power of Power Law: Asymmetry Enables Compositional Reasoning

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work investigates efficient training strategies for models to acquire low-frequency compositional reasoning skills under long-tailed power-law distributions. Focusing on compositional reasoning tasks, the authors compare power-law and uniform data sampling strategies through theoretical sample complexity analysis and loss landscape characterization. They demonstrate that power-law sampling introduces a beneficial asymmetry, enabling models to first master high-frequency skills and subsequently transfer this knowledge effectively to rare compositions. Empirical results on tasks such as state tracking and multi-step arithmetic show that power-law training substantially outperforms uniform sampling. Theoretically, this approach requires fewer samples to achieve comparable or superior generalization, establishing a sample-efficient paradigm for long-tailed compositional generalization.

Technology Category

Application Category

📝 Abstract
Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
Problem

Research questions and friction points this paper is trying to address.

power-law distribution
compositional reasoning
long-tail skills
data distribution
asymmetry
Innovation

Methods, ideas, or system contributions that make the work stand out.

power-law distribution
compositional reasoning
data asymmetry
long-tail learning
loss landscape