🤖 AI Summary
This work addresses the limitation on the number of random directions in zeroth-order hard thresholding algorithms, which arises from the inherent conflict between bias in gradient estimation and the expansiveness of the hard thresholding operator. To resolve this issue, we propose a generalized variance-reduced zeroth-order hard thresholding algorithm that incorporates a variance reduction mechanism to effectively mitigate the tension between zeroth-order gradient estimation error and hard thresholding operations. Theoretically, our method overcomes the dependency on a large number of random directions and achieves an improved convergence rate. Empirical evaluations on ridge regression and black-box adversarial attack tasks demonstrate the efficacy of the proposed approach, and to the best of our knowledge, this work establishes the first convergence guarantees for this class of methods.
📝 Abstract
Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.