Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

📅 2024-09-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the optimality of traditional DMA for I/O in low-latency, high-concurrency workloads—such as microservices and serverless computing—when deployed over cache-coherent interconnects like CXL 3.0. The authors propose and evaluate “programmable I/O”: a CPU-centric architecture where data movement and control are executed via explicit load/store instructions, eliminating dedicated DMA engines and complex address translation. Key contributions include: (1) the first implementation of open cache-coherence-protocol-enabled device-side cache state awareness on real CXL hardware; (2) a lightweight device state machine and memory-mapping optimization; and (3) native support for fine-grained RPC, streaming operator offloading, and serverless network interfaces. Experiments demonstrate substantially reduced communication latency, throughput competitive with DMA, and superior end-to-end performance across all three target scenarios compared to both conventional DMA and PCIe-based memory-mapped PIO.

Technology Category

Application Category

📝 Abstract
Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts for asynchrony between cores and device. In this paper we question this wisdom in the light of two trends: modern and emerging cache-coherent interconnects like CXL3.0, and workloads, particularly microservices and serverless computing. Like some others before us, we argue that the assumptions of the DMA-based model are obsolete, and in many use-cases programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, delivers a more efficient system. However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of coherence with both traditional DMA-style interaction and a highly-optimized implementation using memory-mapped programmed I/O over PCIe.
Problem

Research questions and friction points this paper is trying to address.

Challenges DMA-based I/O efficiency for modern cache-coherent interconnects
Explores programmed I/O benefits for fine-grained communication workloads
Demonstrates coherence protocol advantages over DMA in real hardware
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses cache-coherent interconnects like CXL3.0
Employs programmed I/O for efficient data transfer
Exposes cache transitions to smart devices
🔎 Similar Papers
No similar papers found.