🤖 AI Summary
This work addresses the lack of effective support for large-scale ablation studies in existing open-source frameworks, which forces researchers to rely on extensive custom scripting and hinders efficient, reproducible large language model research. To bridge this gap, we propose an end-to-end, PyTorch-native training framework that, for the first time, unifies efficient pretraining at the scale of hundreds of billions of parameters and trillions of tokens with systematic ablation studies within a single architecture. By integrating advanced parallelization strategies, modular design principles, and a declarative configuration system, our framework substantially reduces experimental development overhead while enhancing reproducibility and engineering efficiency.
📝 Abstract
Today's LLM (pre-) training and research workflows typically allocate a significant amount of compute to large-scale ablation studies. Despite the substantial compute costs of these ablations, existing open-source frameworks provide limited tooling for these experiments, often forcing researchers to write their own wrappers and scripts. We propose Modalities, an end-to-end PyTorch-native framework that integrates data-driven LLM research with large-scale model training from two angles. Firstly, by integrating state-of-the-art parallelization strategies, it enables both efficient pretraining and systematic ablations at trillion-token and billion-parameter scale. Secondly, Modalities adopts modular design with declarative, self-contained configuration, enabling reproducibility and extensibility levels that are difficult to achieve out-of-the-box with existing LLM training frameworks.