LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing hyperparameter optimization (HPO) benchmarks struggle to capture the complex, high-dimensional hyperparameter spaces of real-world large language model (LLM) systems, where AI and non-AI components are deeply intertwined, along with nonlinear fidelity effects and heterogeneous evaluation costs. To address this gap, this work proposes LLMSYS-HPOBench, the first HPO benchmark suite specifically designed for realistic LLM systems. Built from 364,450 configurations across 932 experimental settings, it encompasses 12–23 hyperparameters, 3–5 fidelity dimensions, 3–9 performance objectives, and 2–10 cost metrics, accompanied by structured data and system logs. This open-source platform provides a standardized and scalable infrastructure for evaluating and advancing HPO algorithms in LLM contexts, thereby accelerating the development of AutoML for cutting-edge LLM applications.

📝 Abstract

Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real-world LLM systems, dubbed LLMSYS-HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas-labo/llmsys-hpobench

Problem

Research questions and friction points this paper is trying to address.

Hyperparameter Optimization

Large Language Models

Benchmark Suite

AutoML

Fidelity Factors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperparameter Optimization

Large Language Models

Benchmark Suite

Fidelity Factors

Multi-objective Evaluation

🔎 Similar Papers

Large Language Model Agent for Hyper-Parameter Optimization

2024-02-02arXiv.orgCitations: 12