LOOPerSet: A Large-Scale Dataset for Data-Driven Polyhedral Compiler Optimization

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine learning–driven compiler optimization is hindered by the scarcity of large-scale, high-quality polyhedral program performance datasets, resulting in prohibitively expensive data generation and poor experimental reproducibility. To address this, we introduce the first publicly available, ultra-large-scale polyhedral program performance dataset. Leveraging synthetic program generation, we construct 220,000 structurally diverse polyhedral programs; apply semantics-preserving transformation sequences—including fusion, skewing, tiling, and parallelization—and measure their empirical execution times, yielding a dataset of 28 million labeled samples. This dataset substantially lowers the barrier to developing learned cost models and automated scheduling algorithms, supports diverse model training and benchmarking, improves experimental efficiency and result comparability, and establishes foundational infrastructure for data-driven compiler optimization.

Technology Category

Application Category

📝 Abstract
The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in machine learning for polyhedral compiler optimization
Provides large-scale labeled dataset for training learned cost models
Enables reproducible research in automated polyhedral scheduling transformations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale polyhedral program dataset generation
Mapping transformation sequences to performance measurements
Permissive licensing for reproducible compiler optimization
🔎 Similar Papers
No similar papers found.