See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work exposes adversarial vulnerabilities in referring expression-based multi-object tracking (RMOT) systems, specifically within language–vision grounding and target association modules. To address these weaknesses—particularly in state-of-the-art end-to-end RMOT models leveraging FIFO memory and Transformer-based spatiotemporal reasoning—we propose VEIL, the first consistency-aware adversarial attack framework for joint language–vision modeling, supporting both digital and physical-domain perturbations. VEIL uniquely exploits the temporal persistence of FIFO memory to design cross-frame temporally consistent perturbations, significantly inducing ID switches and track fragmentation. Evaluated on the Refer-KITTI benchmark, VEIL substantially degrades MOTA and IDF1 scores, revealing critical security risks for RMOT in real-world deployment. Our study establishes a new perspective and a standardized adversarial benchmark for robustness evaluation of multimodal tracking systems.

Technology Category

Application Category

📝 Abstract

Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

Problem

Research questions and friction points this paper is trying to address.

Adversarial attacks compromise linguistic-visual association in tracking systems

Vulnerabilities in spatial-temporal reasoning persist across multiple frames

Perturbations disrupt track-object matching causing ID switches and terminations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial framework attacking linguistic-visual association

Targets Transformer-based spatial-temporal reasoning modules

Exploits FIFO memory vulnerabilities in tracking systems

🔎 Similar Papers

A Survey and Evaluation of Adversarial Attacks for Object Detection