Is Intelligence the Right Direction in New OS Scheduling for Multiple Resources in Cloud Environments?

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Multi-resource co-scheduling (cache, memory bandwidth, CPU cores) in cloud environments faces challenges including bottleneck avoidance, priority-awareness, and generalization across heterogeneous servers. This paper proposes OSML+, the first OS-level intelligent scheduler enabling joint optimization of multiple resources. OSML+ integrates multi-model collaborative learning, online reinforcement learning, and transfer learning to support dynamic workload adaptation, QoS-driven multi-objective optimization, and cliff-avoidance. Its key innovations are: (i) the first integration of transfer learning into an OS scheduling framework to enable scalable deployment across large-scale heterogeneous cloud infrastructures; and (ii) a memory-hierarchy-aware unified resource modeling methodology. Experiments demonstrate that OSML+ improves load capacity by 32%, achieves >99.7% QoS compliance, reduces scheduling overhead by 41%, and exhibits strong generalization across multiple generations of heterogeneous servers.

Technology Category

Application Category

📝 Abstract

Making it intelligent is a promising way in System/OS design. This paper proposes OSML+, a new ML-based resource scheduling mechanism for co-located cloud services. OSML+ intelligently schedules the cache and main memory bandwidth resources at the memory hierarchy and the computing core resources simultaneously. OSML+ uses a multi-model collaborative learning approach during its scheduling and thus can handle complicated cases, e.g., avoiding resource cliffs, sharing resources among applications, enabling different scheduling policies for applications with different priorities, etc. OSML+ can converge faster using ML models than previous studies. Moreover, OSML+ can automatically learn on the fly and handle dynamically changing workloads accordingly. Using transfer learning technologies, we show our design can work well across various cloud servers, including the latest off-the-shelf large-scale servers. Our experimental results show that OSML+ supports higher loads and meets QoS targets with lower overheads than previous studies.

Problem

Research questions and friction points this paper is trying to address.

Intelligent scheduling of multiple cloud resources

Handling dynamic workloads with machine learning

Improving QoS with lower overhead in cloud

Innovation

Methods, ideas, or system contributions that make the work stand out.

ML-based resource scheduling for cloud services

Multi-model collaborative learning approach

Transfer learning across various cloud servers

🔎 Similar Papers

Integrating Artificial Intelligence into Operating Systems: A Survey on Techniques, Applications, and Future Directions