Active Imitation Learning for Thermal- and Kernel-Aware LFM Inference on 3D S-NUCA Many-Cores

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the severe performance and thermal safety limitations imposed by thermal hotspots and uneven cache latency arising from on-chip network contention when running large-model inference on 3D stacked non-uniform cache architecture (S-NUCA) multicore CPUs. To tackle this challenge, the paper proposes AILFM, a novel framework that introduces active imitation learning into thermal- and core-aware scheduling for the first time. By learning near-optimal policies from oracle demonstrations, AILFM jointly optimizes thread migration and dynamic voltage/frequency scaling while accounting for both core heterogeneity and the unique kernel characteristics of large models. This approach overcomes the limitations of conventional schedulers that rely on oversimplified models and exhibit poor adaptability, achieving significant performance gains over state-of-the-art methods across diverse large-model workloads while ensuring thermal safety and incurring minimal runtime overhead.

Technology Category

Application Category

📝 Abstract

Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.

Problem

Research questions and friction points this paper is trying to address.

Thermal Management

3D S-NUCA

Large Foundation Model Inference

Thread Migration

Cache Latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Imitation Learning

Thermal-Aware Scheduling

3D S-NUCA