TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the high data collection costs and poor cross-hardware generalization of traditional deep learning compilers, which rely on large-scale offline datasets for tensor program optimization. To overcome these limitations, the authors propose TCL, a novel framework that integrates a representativeness-diversity-uncertainty joint active sampling strategy, a lightweight Mamba-based cost model, and a cross-platform continual knowledge distillation mechanism. This design substantially reduces dependency on training data while enabling rapid optimization across heterogeneous hardware such as CPUs and GPUs. Experimental results demonstrate that TCL achieves 16.8× and 12.48× faster average tuning speeds than Tenset-MLP on CPU and GPU platforms, respectively, while attaining inference latencies of only 1.20× and 1.13× those of Tenset-MLP.

Technology Category

Application Category

📝 Abstract

Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.

Problem

Research questions and friction points this paper is trying to address.

tensor program optimization

cross-hardware transferability

cost model

data efficiency

deep learning compilers

Innovation

Methods, ideas, or system contributions that make the work stand out.

continual learning

tensor program optimization

Mamba-based cost model

active learning

cross-hardware transfer

🔎 Similar Papers

Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning

2024-02-04Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2Citations: 0

ByteDance

圣何塞

Multimodal Model Training and Inference Optimization Engineer

ByteDance

西雅图

Sr. Multimodal Model Training and Inference Optimization Engineer

ByteDance

西雅图

Sr. Multimodal Model Training and Inference Optimization Engineer

ByteDance

圣何塞

Multimodal Model Training and Inference Optimization Engineer

TikTok

San Jose, California

Authors to Follow