Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM) resource-exhaustion attacks increase inference overhead but often produce semantically anomalous outputs, failing to balance effectiveness and stealth. This paper proposes Hidden Tail, a stealthy resource-exhaustion attack against VLMs. Its core innovation lies in generating prompt-agnostic adversarial images that induce models to output excessively long text—up to the maximum token limit—while embedding user-invisible special tokens at the output tail to suppress EOS token generation and extend sequence length, all while preserving semantic naturalness. The method employs a dynamically weighted composite loss function jointly optimizing semantic fidelity, repeated generation of the special token, and EOS suppression. Experiments demonstrate that Hidden Tail achieves output lengths 19.2× longer than baseline methods—significantly outperforming prior attacks—and is the first to simultaneously achieve high-intensity computational resource exhaustion and strong output-level stealth.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose extit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that extit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$ imes$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
Problem

Research questions and friction points this paper is trying to address.

Stealthy resource consumption attacks on Vision-Language Models
Adversarial images inducing maximum-length invisible outputs
Balancing attack effectiveness with output stealthiness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crafting prompt-agnostic adversarial images
Appending invisible special tokens to outputs
Using composite loss with dynamic weighting
🔎 Similar Papers
No similar papers found.
R
Rui Zhang
University of Electronic Science and Technology of China
Z
Zihan Wang
University of Electronic Science and Technology of China
T
Tianli Yang
University of Electronic Science and Technology of China
H
Hongwei Li
University of Electronic Science and Technology of China
Wenbo Jiang
Wenbo Jiang
University of Electronic Science and Technology of China
AI securityBackdoor attack
Qingchuan Zhao
Qingchuan Zhao
City University of Hong Kong
Mobile securityIoT securityProgram AnalysisReverse Engineering
Y
Yang Liu
Nanyang Technological University
Guowen Xu
Guowen Xu
Professor, SMIEEE, University of Electronic Science and Technology of China
Applied CryptographyComputer SecurityAI Security and Privacy