SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In data centers, network processing competes with application threads for CPU resources, exacerbating tail latency and degrading application performance. Method: This paper proposes a chip-multiprocessor (CMP) architecture integrating dedicated Software Data-Transfer (SDT) cores—lightweight, hardware-embedded units within each physical core—to offload the network protocol stack and enforce strict spatiotemporal isolation between network and application threads. This microarchitectural partitioning overcomes inherent limitations of simultaneous multithreading (SMT) in thread contention and tail-latency guarantees. Contribution/Results: We design a customized CMP microarchitecture, hardware resource partitioning mechanism, and low-overhead SDT scheduler, validated via full-system simulation. Compared to a 40-core baseline CMP, the 20-core SDT-CMP reduces die area by 47.5%, power consumption by 66%, and network throughput by <10%, significantly lowering the “network tax.” It achieves, for the first time at the single-chip level, deterministic isolation between networking and computation performance.

Technology Category

Application Category

📝 Abstract

Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.

Problem

Research questions and friction points this paper is trying to address.

Reducing datacenter networking tax with minimal resource expenditure.

Addressing strict tail latency requirements for latency-critical applications.

Optimizing hardware thread usage to minimize network processing overhead.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simultaneous Data-delivery Threads (SDT) per core

Architectural partitioning for performance isolation

Reduced area and power consumption significantly

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization