π€ AI Summary
This work addresses the poor performance of existing tensor compilers on long-tail workloads, where 43% of real-world subgraphs suffer end-to-end slowdowns under default compilation. To tackle this, we propose leveraging large language models (LLMs) to automatically generate structured graph transformations (passes) that integrate seamlessly into compiler pipelines, establishing the first LLM-powered ecosystem for compiler pass generation. We contribute PassNet, a novel dataset, and PassBench, a benchmark suite, along with the error-aware speedup score (ESβ) to jointly evaluate correctness, stability, and performance. A multi-layer defense mechanism is introduced to mitigate LLM misuse. Remarkably, fine-tuning a small model on only ~4K trajectories nearly matches the performance of state-of-the-art models, achieving up to 3Γ speedup over TorchInductor on specific subgraphs, while leaving a 37% optimization margin overall.
π Abstract
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.