π€ AI Summary
This work addresses the limitations of existing multimodal large language modelβbased approaches to video anomaly understanding, which often rely on superficial descriptions and lack deep reasoning and self-correction capabilities. To overcome this, the authors propose a reflection-enhanced reasoning framework that introduces the first Reflection-oriented Chain-of-Thought (RoCoT) dataset tailored for video anomaly understanding. They further design a reflection-aware learning paradigm that leverages both supervised fine-tuning and reinforcement fine-tuning to guide the model in performing self-reflection and correction after its initial reasoning. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches across multiple video anomaly benchmarks, achieving notable improvements in both temporal localization accuracy and reasoning quality.
π Abstract
Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.