The Power of Power Law: Asymmetry Enables Compositional Reasoning

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work investigates efficient training strategies for models to acquire low-frequency compositional reasoning skills under long-tailed power-law distributions. Focusing on compositional reasoning tasks, the authors compare power-law and uniform data sampling strategies through theoretical sample complexity analysis and loss landscape characterization. They demonstrate that power-law sampling introduces a beneficial asymmetry, enabling models to first master high-frequency skills and subsequently transfer this knowledge effectively to rare compositions. Empirical results on tasks such as state tracking and multi-step arithmetic show that power-law training substantially outperforms uniform sampling. Theoretically, this approach requires fewer samples to achieve comparable or superior generalization, establishing a sample-efficient paradigm for long-tailed compositional generalization.

Technology Category

Application Category

📝 Abstract

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

Problem

Research questions and friction points this paper is trying to address.

power-law distribution

compositional reasoning

long-tail skills

data distribution

asymmetry

Innovation

Methods, ideas, or system contributions that make the work stand out.

power-law distribution

compositional reasoning

data asymmetry