๐ค AI Summary
This work addresses the vulnerability of cumulative distribution function (CDF)-based linear regression models in learned indexes to poisoning attacks and the lack of rigorous theoretical analysis thereof. For the first time, it formally models poisoning attacks targeting CDF-oriented linear regression. Through theoretical analysis and optimization, the study proves that a single-point attack admits a unique optimal form, reveals the suboptimality of greedy multi-point attacks, and derives a tight upper bound on the performance degradation caused by multi-point attacks. Experimental validation demonstrates that greedy strategies closely approach this theoretical upper bound in practice. The paper thus establishes a foundational theoretical framework and a quantitative evaluation methodology for assessing the robustness of learned indexes against data poisoning.
๐ Abstract
Learned indexes are a class of index data structures that enable fast search by approximating the cumulative distribution function (CDF) using machine learning models (Kraska et al., SIGMOD'18). However, recent studies have shown that learned indexes are vulnerable to poisoning attacks, where injecting a small number of poison keys into the training data can significantly degrade model accuracy and reduce index performance (Kornaropoulos et al., SIGMOD'22). In this work, we provide a rigorous theoretical analysis of poisoning attacks targeting linear regression models over CDFs, one of the most basic regression models and a core component in many learned indexes. Our main contributions are as follows: (i) We present a theoretical proof characterizing the optimal single-point poisoning attack and show that the existing method yields the optimal attack. (ii) We show that in multi-point attacks, the existing greedy approach is not always optimal, and we rigorously derive the key properties that an optimal attack should satisfy. (iii) We propose a method to compute an upper bound of the multi-point poisoning attack's impact and empirically demonstrate that the loss under the greedy approach is often close to this bound. Our study deepens the theoretical understanding of attack strategies against linear regression models on CDFs and provides a foundation for the theoretical evaluation of attacks and defenses on learned indexes.