🤖 AI Summary
This work addresses poisoning attacks against ridge regression models incorporating both numerical and categorical features. We propose the first exact poisoning attack capable of precisely manipulating categorical variables. To model the one-hot encoding constraints inherent to categorical features, we explicitly introduce the Special-Ordered-Set-of-type-1 (SOS-1) condition—novel in this context—and formulate a nonconvex mixed-integer bilevel optimization problem. By applying the Karush–Kuhn–Tucker (KKT) conditions, we reformulate the lower-level convex quadratic program into a single-level constraint set; further, we enhance computational efficiency and solution accuracy via variable bounding, pruning, and tailored integer programming techniques. Experiments on multiple benchmark datasets demonstrate that our method significantly increases the model’s mean squared error (MSE), consistently outperforming existing poisoning baselines. This work establishes a new paradigm for evaluating and improving the robustness of categorical-aware machine learning models.
📝 Abstract
Machine Learning (ML) models have become a very powerful tool to extract information from large datasets and use it to make accurate predictions and automated decisions. However, ML models can be vulnerable to external attacks, causing them to underperform or deviate from their expected tasks. One way to attack ML models is by injecting malicious data to mislead the algorithm during the training phase, which is referred to as a poisoning attack. We can prepare for such situations by designing anticipated attacks, which are later used for creating and testing defence strategies. In this paper, we propose an algorithm to generate strong poisoning attacks for a ridge regression model containing both numerical and categorical features that explicitly models and poisons categorical features. We model categorical features as SOS-1 sets and formulate the problem of designing poisoning attacks as a bilevel optimization problem that is nonconvex mixed-integer in the upper-level and unconstrained convex quadratic in the lower-level. We present the mathematical formulation of the problem, introduce a single-level reformulation based on the Karush-Kuhn-Tucker (KKT) conditions of the lower level, find bounds for the lower-level variables to accelerate solver performance, and propose a new algorithm to poison categorical features. Numerical experiments show that our method improves the mean squared error of all datasets compared to the previous benchmark in the literature.