TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

📅 2025-01-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale GPU clusters for LLM inference in cloud data centers, fine-grained phase variations and multi-dimensional configuration coupling cause thermal runaway and power capping violations. To address this, we propose a temperature- and power-aware dynamic resource scheduling framework. Our approach innovatively integrates VM placement, request routing, and online SaaS-VM reconfiguration to enable real-time response to cooling or power supply failures. It leverages historical temperature and power telemetry for predictive scheduling, employs multi-objective constraint-satisfaction optimization, and supports elastic workload adaptation. Evaluated on production-scale GPU clusters, the framework reduces thermal and power throttling events significantly, improves resource utilization by 18.7%, and lowers total cost of ownership (TCO) by 12.3%.

Technology Category

Application Category

📝 Abstract
The rising demand for generative large language models (LLMs) poses challenges for thermal and power management in cloud datacenters. Traditional techniques often are inadequate for LLM inference due to the fine-grained, millisecond-scale execution phases, each with distinct performance, thermal, and power profiles. Additionally, LLM inference workloads are sensitive to various configuration parameters (e.g., model parallelism, size, and quantization) that involve trade-offs between performance, temperature, power, and output quality. Moreover, clouds often co-locate SaaS and IaaS workloads, each with different levels of visibility and flexibility. We propose TAPAS, a thermal- and power-aware framework designed for LLM inference clusters in the cloud. TAPAS enhances cooling and power oversubscription capabilities, reducing the total cost of ownership (TCO) while effectively handling emergencies (e.g., cooling and power failures). The system leverages historical temperature and power data, along with the adaptability of SaaS workloads, to: (1) efficiently place new GPU workload VMs within cooling and power constraints, (2) route LLM inference requests across SaaS VMs, and (3) reconfigure SaaS VMs to manage load spikes and emergency situations. Our evaluation on a large GPU cluster demonstrates significant reductions in thermal and power throttling events, boosting system efficiency.
Problem

Research questions and friction points this paper is trying to address.

Cloud Data Centers
Large Language Models
Resource Management
Innovation

Methods, ideas, or system contributions that make the work stand out.

Energy-Efficient Scheduling
LLM Task Allocation
Cooling System Optimization
🔎 Similar Papers
No similar papers found.