Towards CXL Resilience to CPU Failures

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the lack of fault tolerance for CPU node failures in CXL 3.0+, which can lead to loss of dirty cache data and violation of application state consistency. To mitigate this, the paper presents ReCXL, the first lightweight hardware architecture that extends the CXL protocol by synchronously replicating cache line updates to hardware logging units on other nodes during write operations. Consistency recovery after a failure is enabled through periodic persistence of these logs. ReCXL ensures correct recovery of distributed shared-memory applications following node failures while incurring only a 30% performance overhead, thereby significantly enhancing system reliability without substantial cost to efficiency.

Technology Category

Application Category

📝 Abstract

Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.

Problem

Research questions and friction points this paper is trying to address.

CXL

CPU failure

resilience

cache coherence

fault tolerance

Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL resilience

fault tolerance

cache coherence