COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the mismatch between training and inference in existing vision-language models for video anomaly detection, which arises from static adaptation strategies with poor generalization and the discrepancy between sparse training and dense inference setups. To resolve this, the authors propose COPRA, a novel framework that introduces, for the first time, an input-conditioned parameter generation mechanism. COPRA leverages reinforcement learning to dynamically generate task-specific parameters for each video segment while keeping the underlying vision-language model frozen, thereby achieving consistent dynamic adaptation across both training and inference phases. The method substantially outperforms static baselines and demonstrates strong performance on both in-domain and cross-domain anomaly detection tasks. Notably, it also generalizes effectively to unseen downstream tasks such as video question answering and dense video captioning.

📝 Abstract

Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at https://github.com/THE-MALT-LAB/COPRA

Problem

Research questions and friction points this paper is trying to address.

video anomaly detection

vision-language models

distribution shift

training-inference mismatch

model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Parameter Adaptation

Vision-Language Models

Video Anomaly Detection