🤖 AI Summary
This work addresses the lack of effective, targeted, and downstream-task-agnostic attack methods against pretrained encoders under unknown downstream tasks. It proposes, for the first time, a targeted downstream-agnostic attack under a strict threat model by introducing “threat images” as feature anchors. A generator crafts sample-specific adversarial perturbations that compel the encoder to produce feature representations aligned with those of the threat image. This approach overcomes the limitations of conventional shared-perturbation strategies and establishes a task-agnostic, feature-level attack framework. Extensive experiments across three benchmark datasets and ten self-supervised learning methods demonstrate that the proposed method significantly improves both attack success rate and imperceptibility, exposing critical security vulnerabilities in pretrained encoders.
📝 Abstract
Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.