🤖 AI Summary
This work investigates the true contribution of algorithmic advances to the 22,000× improvement in AI training efficiency observed between 2012 and 2023.
Method: We combine small-scale ablation experiments, cross-model scaling analysis, literature-based extrapolation, FLOP-efficiency modeling, and optimal scaling laws.
Contribution/Results: We find that algorithmic gains are strongly scale-dependent—not scale-invariant as commonly assumed. Transformers exhibit exponential efficiency advantages over LSTMs under scaling, constituting the dominant source of improvement. Our analysis quantitatively accounts for 6,930× of the total efficiency gain. Crucially, we reveal that algorithmic progress is severely overestimated for small models: approximately 99% of reported “algorithmic gains” arise from scale synergy rather than intrinsic algorithmic improvements. This challenges the prevailing assumption that algorithmic progress is independent of compute scale and establishes a new paradigm—scale-aware algorithmic evaluation—that explicitly incorporates hardware and scaling context into algorithm assessment.
📝 Abstract
Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.