🤖 AI Summary
Modern data storage systems suffer from latent cross-layer faults due to tight hardware–software coupling across multiple abstraction layers, often leading to silent data corruption or unrecoverable data loss. To address this, we propose the first cross-layer fault-tolerance analysis framework targeting heterogeneous storage stacks—including SSDs, persistent memory, local file systems, and distributed storage. Our approach combines architectural modeling of the full stack, systematic injection of representative defects, and precise tracking of fault propagation across hardware–firmware–software boundaries to expose error propagation paths and consistency violation mechanisms. Through empirical evaluation across widely deployed systems, we identify critical vulnerabilities impacting data integrity and quantify coverage gaps in existing fault-tolerance techniques. The framework provides a scalable, principled methodology for analyzing cross-layer resilience and establishes concrete, actionable directions for designing next-generation highly reliable storage systems.
📝 Abstract
Data storage systems serve as the foundation of digital society. The enormous data generated by people on a daily basis make the fault tolerance of data storage systems increasingly important. Unfortunately, modern storage systems consist of complicated hardware and software layers interacting with each other, which may contain latent bugs that elude extensive testing and lead to data corruption, system downtime, or even unrecoverable data loss in practice. In this chapter, we take a holistic view to introduce the typical architecture and major components of modern data storage systems (e.g., solid state drives, persistent memories, local file systems, and distributed storage management at scale). Next, we discuss a few representative bug detection and fault tolerance techniques across layers with a focus on issues that affect system recovery and data integrity. Finally, we conclude with open challenges and future work.