SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

πŸ“… 2026-02-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing multimodal large language model–based approaches to video anomaly understanding, which often rely on superficial descriptions and lack deep reasoning and self-correction capabilities. To overcome this, the authors propose a reflection-enhanced reasoning framework that introduces the first Reflection-oriented Chain-of-Thought (RoCoT) dataset tailored for video anomaly understanding. They further design a reflection-aware learning paradigm that leverages both supervised fine-tuning and reinforcement fine-tuning to guide the model in performing self-reflection and correction after its initial reasoning. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple video anomaly benchmarks, achieving notable improvements in both temporal localization accuracy and reasoning quality.

Technology Category

Application Category

πŸ“ Abstract
Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.
Problem

Research questions and friction points this paper is trying to address.

video anomaly understanding
multi-modal large language models
deep reasoning
self-reflection
abnormal behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

reflection-aware learning
video anomaly understanding
multi-modal large language models
Chain-of-Thought
self-reflection
πŸ”Ž Similar Papers
No similar papers found.
Z
Zihao Zhao
Department of Computer Science, The University of Iowa, Iowa City, IA, USA
S
Shengting Cao
Department of Computer Science, Knox College, Galesburg, IL, USA
Muchao Ye
Muchao Ye
The University of Iowa
Machine LearningArtificial Intelligence