🤖 AI Summary
Multi-resource co-scheduling (cache, memory bandwidth, CPU cores) in cloud environments faces challenges including bottleneck avoidance, priority-awareness, and generalization across heterogeneous servers. This paper proposes OSML+, the first OS-level intelligent scheduler enabling joint optimization of multiple resources. OSML+ integrates multi-model collaborative learning, online reinforcement learning, and transfer learning to support dynamic workload adaptation, QoS-driven multi-objective optimization, and cliff-avoidance. Its key innovations are: (i) the first integration of transfer learning into an OS scheduling framework to enable scalable deployment across large-scale heterogeneous cloud infrastructures; and (ii) a memory-hierarchy-aware unified resource modeling methodology. Experiments demonstrate that OSML+ improves load capacity by 32%, achieves >99.7% QoS compliance, reduces scheduling overhead by 41%, and exhibits strong generalization across multiple generations of heterogeneous servers.
📝 Abstract
Making it intelligent is a promising way in System/OS design. This paper proposes OSML+, a new ML-based resource scheduling mechanism for co-located cloud services. OSML+ intelligently schedules the cache and main memory bandwidth resources at the memory hierarchy and the computing core resources simultaneously. OSML+ uses a multi-model collaborative learning approach during its scheduling and thus can handle complicated cases, e.g., avoiding resource cliffs, sharing resources among applications, enabling different scheduling policies for applications with different priorities, etc. OSML+ can converge faster using ML models than previous studies. Moreover, OSML+ can automatically learn on the fly and handle dynamically changing workloads accordingly. Using transfer learning technologies, we show our design can work well across various cloud servers, including the latest off-the-shelf large-scale servers. Our experimental results show that OSML+ supports higher loads and meets QoS targets with lower overheads than previous studies.