SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the challenge of jointly optimizing training efficiency and cost for machine learning workloads in cloud environments while satisfying SLA constraints, this paper proposes a user-preference-driven adaptive multi-objective reinforcement learning framework. The method mitigates cold-start issues via historical policy transfer for intelligent initialization, introduces an SLA-violation-aware dynamic weighting mechanism for online priority adjustment, and formulates a 21-dimensional state space—encompassing resource utilization, training progress, and SLA compliance—and a 9-dimensional action space. Leveraging an Actor-Critic architecture, it jointly optimizes training time and operational cost. Extensive experiments across 13 real-world ML workloads demonstrate that our approach achieves, on average, a 67.2% reduction in training time, a 68.8% decrease in operational cost, and a 73.4% improvement in SLA compliance rate compared to static baselines.

Technology Category

Application Category

📝 Abstract

Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints. Traditional approaches employ static resource allocation or single-objective optimization, leading to either SLA violations or resource waste. We present SLA-MORL, an adaptive multi-objective reinforcement learning framework that intelligently allocates GPU and CPU resources based on user-defined preferences (time, cost, or balanced) while ensuring SLA compliance. Our approach introduces two key innovations: (1) intelligent initialization through historical learning or efficient baseline runs that eliminates cold-start problems, reducing initial exploration overhead by 60%, and (2) dynamic weight adaptation that automatically adjusts optimization priorities based on real-time SLA violation severity, creating a self-correcting system. SLA-MORL constructs a 21-dimensional state representation capturing resource utilization, training progress, and SLA compliance, enabling an actor-critic network to make informed allocation decisions across 9 possible actions. Extensive evaluation on 13 diverse ML workloads using production HPC infrastructure demonstrates that SLA-MORL achieves 67.2% reduction in training time for deadline-critical jobs, 68.8% reduction in costs for budget-constrained workloads, and 73.4% improvement in overall SLA compliance compared to static baselines. By addressing both cold-start inefficiency and dynamic adaptation challenges, SLA-MORL provides a practical solution for cloud resource management that balances performance, cost, and reliability in modern ML training environments.

Problem

Research questions and friction points this paper is trying to address.

Dynamic GPU/CPU allocation for ML workloads in clouds

Balancing training time, cost, and SLA compliance

Overcoming cold-start and static allocation inefficiencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive multi-objective reinforcement learning for resource allocation

Intelligent initialization eliminates cold-start problems

Dynamic weight adaptation adjusts optimization priorities automatically

🔎 Similar Papers

No similar papers found.