How Do Large Language Monkeys Get Their Power (Laws)?

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work resolves the paradox of coexisting “microscopic exponential decay” and “macroscopic power-law scaling” in large language models (LLMs) under multi-task, multi-attempt settings: per-task failure rates decay exponentially with attempt count, yet overall success probability follows a power law. We propose a unified explanation grounded in the heavy-tailed distribution of task-wise success probabilities—the “heavy-tail hypothesis”—and develop a statistical model based on this premise. Our method integrates heavy-tailed distribution modeling, empirical validation of scaling laws, and analysis of inference complexity. The resulting framework achieves high-precision prediction of the power-law exponent, reducing relative error by an order of magnitude. Beyond explaining previously observed scaling anomalies, our approach equivalently reduces inference compute requirements by 2–4 orders of magnitude. This work establishes a novel theoretical foundation for understanding LLM scaling behavior and delivers a practical predictive tool for system design and resource allocation.

Technology Category

Application Category

📝 Abstract
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${sim}2-4$ orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
Problem

Research questions and friction points this paper is trying to address.

Explains power law in language model success rates
Links per-problem exponential scaling to aggregate polynomial scaling
Provides forecasting method for power law exponent
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heavy-tailed distribution analysis
Exponential to power law transition
Forecasting with reduced compute
🔎 Similar Papers
No similar papers found.