🤖 AI Summary
Time-series anomaly detection (TSAD) faces challenges including limited visual-temporal reasoning in numerical modeling, poor generalization, and weak interpretability. This paper introduces vision-language models (VLMs) to TSAD for the first time, proposing a two-stage cross-modal framework: ViT4TS localizes candidate anomalies via 2D time-series image representation, while VLM4TS refines detections at the semantic level by integrating global temporal context. Key contributions include: (1) a VLM-driven TSAD paradigm eliminating the need for time-series pretraining; (2) a lightweight ViT encoder jointly optimized with a frozen VLM; and (3) multimodal prompt engineering coupled with temporal-visual alignment reasoning. Experiments show a 24.6% improvement in F1-max over the best baseline, 36× higher token efficiency than existing language-model-based methods, and state-of-the-art zero-shot transfer performance across healthcare, finance, and industrial domains.
📝 Abstract
Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and industrial monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal reasoning capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual reasoning tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pretrained vision encoder, which leverages 2-D time-series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM reasoning capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pretrained and from-scratch baselines in most cases, yielding a 24.6 percent improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language-model-based TSAD methods and is on average 36 times more efficient in token usage.