🤖 AI Summary
This work investigates how architectural and data-related design decisions—not merely parameter count—affect downstream performance, addressing the counterintuitive phenomenon where smaller language models sometimes outperform larger ones.
Method: We conduct the first meta-analysis across 92 open-source models, integrating multivariate regression with cross-scale performance attribution to construct an interpretable, quantitative framework for measuring design effects.
Contribution/Results: We identify several non-scale factors—particularly rotary positional encoding and code-data mixing at 15–25%—as statistically significant performance levers, improving downstream task prediction accuracy by 3–28%. These findings demonstrate that principled optimization of data composition and model architecture can systematically unlock the latent capabilities of smaller models, challenging scale-centric evaluation paradigms. The study provides empirically grounded design principles and a reusable methodology for efficient, performance-aware language model development.
📝 Abstract
Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.