🤖 AI Summary
This work addresses the severe performance and thermal safety limitations imposed by thermal hotspots and uneven cache latency arising from on-chip network contention when running large-model inference on 3D stacked non-uniform cache architecture (S-NUCA) multicore CPUs. To tackle this challenge, the paper proposes AILFM, a novel framework that introduces active imitation learning into thermal- and core-aware scheduling for the first time. By learning near-optimal policies from oracle demonstrations, AILFM jointly optimizes thread migration and dynamic voltage/frequency scaling while accounting for both core heterogeneity and the unique kernel characteristics of large models. This approach overcomes the limitations of conventional schedulers that rely on oversimplified models and exhibit poor adaptability, achieving significant performance gains over state-of-the-art methods across diverse large-model workloads while ensuring thermal safety and incurring minimal runtime overhead.
📝 Abstract
Large Foundation Model (LFM) inference is both memory- and compute-intensive, traditionally relying on GPUs. However, the limited availability and high cost have motivated the adoption of high-performance general-purpose CPUs, especially emerging 3D-stacked Static Non-Uniform Cache Architecture (3D S-NUCA) systems. These architectures offer enhanced bandwidth and locality but suffer from severe thermal challenges and uneven cache latencies due to 3D Networks-on-Chip (NoC). Optimal management of thread migration and V/f scaling is non-trivial due to LFM kernel diversity and system heterogeneity. Existing thermal management approaches often rely on oversimplified analytical models and lack adaptability. We propose AILFM, an Active Imitation Learning (AIL)-based scheduling framework that learns near-optimal thermal-aware scheduling policies from Oracle demonstrations with minimal run-time overhead. AILFM accounts for both core-level performance heterogeneity and kernel-specific behavior in LFMs to maintain thermal safety while maximizing performance. Extensive experiments show that AILFM outperforms state-of-the-art baselines and generalizes well across diverse LFM workloads.