Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work proposes an efficient gradient estimation framework for hard-label black-box attacks, where only model predictions are accessible. By integrating frequency-domain zeroth-order initialization with a pattern-driven optimization (PDO) strategy, the method accurately approximates the sign of the true loss gradient under extremely limited query budgets. It is the first to theoretically unify existing hard-label attacks as gradient sign approximation approaches and provides a principled initialization and optimization mechanism with theoretical guarantees, substantially reducing query complexity. Experiments demonstrate that the approach outperforms state-of-the-art methods on benchmarks including CIFAR-10, ImageNet, and ObjectNet, achieving higher attack success rates with fewer queries, completely evading the Blacklight defense (0% detection rate), and generalizing effectively to biomedical images and dense prediction tasks.

Technology Category

Application Category

📝 Abstract

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

Problem

Research questions and friction points this paper is trying to address.

hard-label black-box attack

gradient estimation

label-only oracle

query efficiency

adversarial attack

Innovation

Methods, ideas, or system contributions that make the work stand out.

hard-label black-box attack

gradient sign recovery

frequency-domain initialization