An Open-Source Training Dataset for Foundation Models for Black-box Optimization

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This work addresses the limitations of black-box optimization methods, which often rely heavily on hyperparameter tuning, exhibit limited generalization, and suffer from a lack of large-scale, publicly available real-world data that hinders the development of foundational models. To overcome these challenges, the authors introduce BBO-Pile, the first open-source dataset of black-box optimization trajectories, encompassing 3,095 distinct problems and over 500,000 optimization trajectories. Leveraging this dataset, they train multi-scale Transformer-based foundational models ranging from 2M to 80M parameters and trained on 200M to 2B tokens. Experimental results demonstrate that large-scale pretraining effectively captures general-purpose optimization strategies, enabling the proposed models to outperform traditional handcrafted approaches across diverse tasks, thereby advancing reproducible and generalizable foundational model research in black-box optimization.

📝 Abstract

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

Problem

Research questions and friction points this paper is trying to address.

black-box optimization

foundation models

pre-training dataset

generalization

reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models

black-box optimization

open-source dataset