From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge

πŸ“… 2025-11-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study uncovers a novel adversarial threat against open-source video foundation models (VFMs) under zero-task-knowledge conditionsβ€”i.e., without access to downstream tasks, training data, model architectures, or query interfaces. To address this, we propose the Temporal-Aware Adversarial (TVA) attack framework, which integrates bidirectional contrastive learning with a temporal consistency loss to generate cross-task transferable adversarial videos using only pretrained VFMs. Experiments across 24 diverse video understanding tasks demonstrate that TVA significantly outperforms conventional transfer-based attacks. It achieves high attack success rates both during fine-tuning of downstream models and multimodal large language models (MLLMs). Crucially, our results provide the first empirical evidence that pretrained VFM representations themselves exhibit inherent and substantial adversarial vulnerability. This work establishes a new paradigm for security assessment in the open video model ecosystem.

Technology Category

Application Category

πŸ“ Abstract
Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.
Problem

Research questions and friction points this paper is trying to address.

Investigating adversarial attacks on video foundation models without victim task knowledge
Proposing temporal-aware attack method exploiting video representation dynamics
Revealing security vulnerabilities in downstream video models and MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

TVA leverages temporal representation dynamics for attacks
Uses bidirectional contrastive learning for feature discrepancy
Introduces temporal consistency loss exploiting motion cues
πŸ”Ž Similar Papers
No similar papers found.
Hui Lu
Hui Lu
Department of Computer Science and Engineering (CSE), the University of Texas at Arlington (UTA)
Cloud ComputingVirtualizationFile and Storage SystemsComputer NetworksComputer Systems
Y
Yi Yu
ROSE Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
Song Xia
Song Xia
NTU
Machine Learning
Y
Yiming Yang
Nanyang Technological University, Singapore
Deepu Rajan
Deepu Rajan
Nanyang Technological University
Image ProcessingComputer Vision
B
Boon Poh Ng
Nanyang Technological University, Singapore
Alex Kot
Alex Kot
Nanyang Technological University
signal processing & machine learning
X
Xudong Jiang
ROSE Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore