Linux Kernel Configurations at Scale: A Dataset for Performance and Evolution Analysis

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing research is hindered by the lack of large-scale, Linux kernel configuration datasets that span multiple kernel versions and provide fine-grained, quantitative metrics. To address this, we introduce LinuxData—a curated dataset comprising over 240,000 automatically sampled configurations across 27 kernel versions (4.13–5.8), uniformly annotated with compilation outcomes and binary sizes. Our approach enables, for the first time, cross-version prediction of compilation success rate and binary size with high accuracy (mean absolute percentage error <3.2%). This supports advanced configurable-system analyses, including configuration-space modeling, evolutionary analysis, and transfer learning. The dataset is publicly released with a lightweight Python API, OpenML integration, and standardized machine learning benchmarking pipelines. By providing reproducible, version-aware ground truth and streamlined evaluation infrastructure, LinuxData significantly enhances both reproducibility and generalizability in configurable systems research.

Technology Category

Application Category

📝 Abstract

Configuring the Linux kernel to meet specific requirements, such as binary size, is highly challenging due to its immense complexity-with over 15,000 interdependent options evolving rapidly across different versions. Although several studies have explored sampling strategies and machine learning methods to understand and predict the impact of configuration options, the literature still lacks a comprehensive and large-scale dataset encompassing multiple kernel versions along with detailed quantitative measurements. To bridge this gap, we introduce LinuxData, an accessible collection of kernel configurations spanning several kernel releases, specifically from versions 4.13 to 5.8. This dataset, gathered through automated tools and build processes, comprises over 240,000 kernel configurations systematically labeled with compilation outcomes and binary sizes. By providing detailed records of configuration evolution and capturing the intricate interplay among kernel options, our dataset enables innovative research in feature subset selection, prediction models based on machine learning, and transfer learning across kernel versions. Throughout this paper, we describe how the dataset has been made easily accessible via OpenML and illustrate how it can be leveraged using only a few lines of Python code to evaluate AI-based techniques, such as supervised machine learning. We anticipate that this dataset will significantly enhance reproducibility and foster new insights into configuration-space analysis at a scale that presents unique opportunities and inherent challenges, thereby advancing our understanding of the Linux kernel's configurability and evolution.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Linux kernel configuration dataset for analysis

Challenges in predicting configuration impacts across kernel versions

Need for systematic performance and evolution tracking in kernel options

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated collection of 240,000 kernel configurations

Systematic labeling with compilation outcomes and sizes

Enables AI-based techniques via accessible OpenML dataset

🔎 Similar Papers

No similar papers found.