π€ AI Summary
This work addresses the memory wall between processors and storage, where data movement has become a critical performance bottleneck. Existing computational storage solutions struggle to scale due to programming complexity, ecosystem fragmentation, and thermal/power constraints. To overcome these limitations, the authors propose a reversible computational storage architecture that enables dynamic migration of WebAssembly-compiled storage executables between the host and CXL SSDs. The design leverages CXL.memβs cache coherence for seamless state sharing and introduces a zero-copy drain-and-switch protocol to manage thermal and power constraints. An agility-aware scheduler elastically dispatches compute tasks based on runtime conditions. Evaluations on both FPGA prototypes and commercial computational storage devices demonstrate up to 2Γ higher throughput and 3.75Γ lower write latency without requiring application modifications, effectively transforming rigid thermal limits into tunable performance trade-offs.
π Abstract
The widening gap between processor speed and storage latency has made data movement a dominant bottleneck in modern systems. Two lines of storage-layer innovation attempted to close this gap: persistent memory shortened the latency hierarchy, while computational storage devices pushed processing toward the data. Neither has displaced conventional NVMe SSDs at scale, largely due to programming complexity, ecosystem fragmentation, and thermal/power cliffs under sustained load. We argue that storage-side compute should be \emph{reversible}: computation should migrate dynamically between host and device based on runtime conditions. We present \sys, which realizes this principle on CXL SSDs by decomposing I/O-path logic into migratable \emph{storage actors} compiled to WebAssembly. Actors share state through coherent CXL.mem regions; an agility-aware scheduler migrates them via a zero-copy drain-and-switch protocol when thermal or power constraints arise. Our evaluation on an FPGA-based CXL SSD prototype and two production CSDs shows that \sys turns hard thermal cliffs into elastic trade-offs, achieving up to 2$\times$ throughput improvement and 3.75$\times$ write latency reduction without application modification.