LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language model–based video anomaly detection methods, which often lack structured temporal context during inference, leading to fragmented predictions and poor interpretability. To overcome this, the authors propose LATERN, a framework that reframes anomaly detection as a temporal evidence aggregation process. LATERN leverages a Context-aware Anomaly scoring (CEA) module and a Recursive Evidence Aggregation (REA) mechanism to enable training-free, test-time reasoning on frozen vision-language models. It further introduces an image-anchored memory mechanism that dynamically selects diverse historical keyframes as extended context, yielding temporally coherent and semantically interpretable event-level decisions. Experiments on UCF-Crime and XD-Violence benchmarks demonstrate that LATERN significantly improves both detection accuracy and explanation consistency.

📝 Abstract

Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

Problem

Research questions and friction points this paper is trying to address.

video anomaly detection

vision-language models

temporal context

explainability

test-time inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

context-aware

temporal evidence aggregation

vision-language models