🤖 AI Summary
Learning-based database systems face expanded implicit attack surfaces due to shared ML components (e.g., learned indexes, caches), rendering traditional security analysis ineffective against emerging threats. This paper introduces the first security paradigm for learning-based systems and proposes the first vulnerability identification framework tailored to embedded ML systems. Our approach integrates system security analysis, adversarial machine learning, and side-channel modeling. It uncovers three novel, generic threats: (1) data leakage via learned components, (2) exponential memory explosion, and (3) cross-user temporal side channels. We discover dozens of previously unknown vulnerabilities across state-of-the-art learning-enhanced databases—including DeepDB, SageDB, and LISA—and empirically demonstrate three high-impact attacks: historical query reconstruction, sub-second index collapse, and cross-user key distribution inference. These results confirm that ML integration introduces pervasive, severe security risks—fundamentally challenging the assumption that learned components are merely performance optimizations.
📝 Abstract
A learned system uses machine learning (ML) internally to improve performance. We can expect such systems to be vulnerable to some adversarial-ML attacks. Often, the learned component is shared between mutually-distrusting users or processes, much like microarchitectural resources such as caches, potentially giving rise to highly-realistic attacker models. However, compared to attacks on other ML-based systems, attackers face a level of indirection as they cannot interact directly with the learned model. Additionally, the difference between the attack surface of learned and non-learned versions of the same system is often subtle. These factors obfuscate the de-facto risks that the incorporation of ML carries. We analyze the root causes of potentially-increased attack surface in learned systems and develop a framework for identifying vulnerabilities that stem from the use of ML. We apply our framework to a broad set of learned systems under active development. To empirically validate the many vulnerabilities surfaced by our framework, we choose 3 of them and implement and evaluate exploits against prominent learned-system instances. We show that the use of ML caused leakage of past queries in a database, enabled a poisoning attack that causes exponential memory blowup in an index structure and crashes it in seconds, and enabled index users to snoop on each others' key distributions by timing queries over their own keys. We find that adversarial ML is a universal threat against learned systems, point to open research gaps in our understanding of learned-systems security, and conclude by discussing mitigations, while noting that data leakage is inherent in systems whose learned component is shared between multiple parties.