CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitations of existing video reward models in accurately identifying and localizing fine-grained anomalies and their lack of interpretable spatiotemporal reasoning. The authors propose a coarse-to-fine anomaly reward modeling approach grounded in vision-language models: it first anchors anomalous temporal windows via global scanning, then performs local spatial localization, and finally integrates a structured spatiotemporal chain-of-thought for robust judgment. Key contributions include the construction of the first large-scale synthetic video anomaly dataset annotated with per-frame bounding boxes, temporal anomaly windows, and fine-grained attribution labels; the design of a spatiotemporal IoU-based reward mechanism to guide intermediate localization; and a three-stage progressive training paradigm combining supervised fine-tuning with reinforcement learning driven by dual reward signals. Experiments show a 25.7% accuracy gain on fine-grained anomaly benchmarks, an 11.7% reduction in generated video anomalies when used as a reward signal, and significant improvements in overall video quality.

📝 Abstract

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

Problem

Research questions and friction points this paper is trying to address.

video anomaly detection

spatiotemporal reasoning

fine-grained localization

reward modeling

anomaly grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Spatiotemporal Concentrating

Vision-Language Models

Anomaly Reward Model

Temporal and Spatial IoU Rewards

Group Relative Policy Optimization

🔎 Similar Papers

No similar papers found.