SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing multimodal large language model–based approaches to video anomaly understanding, which often rely on superficial descriptions and lack deep reasoning and self-correction capabilities. To overcome this, the authors propose a reflection-enhanced reasoning framework that introduces the first Reflection-oriented Chain-of-Thought (RoCoT) dataset tailored for video anomaly understanding. They further design a reflection-aware learning paradigm that leverages both supervised fine-tuning and reinforcement fine-tuning to guide the model in performing self-reflection and correction after its initial reasoning. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple video anomaly benchmarks, achieving notable improvements in both temporal localization accuracy and reasoning quality.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.

Problem

Research questions and friction points this paper is trying to address.

video anomaly understanding

multi-modal large language models

deep reasoning

self-reflection

abnormal behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

reflection-aware learning

video anomaly understanding

multi-modal large language models

Chain-of-Thought

self-reflection

🔎 Similar Papers

No similar papers found.

Authors to Follow