CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the limitations of existing video reward models in accurately identifying and localizing fine-grained anomalies and their lack of interpretable spatiotemporal reasoning. The authors propose a coarse-to-fine anomaly reward modeling approach grounded in vision-language models: it first anchors anomalous temporal windows via global scanning, then performs local spatial localization, and finally integrates a structured spatiotemporal chain-of-thought for robust judgment. Key contributions include the construction of the first large-scale synthetic video anomaly dataset annotated with per-frame bounding boxes, temporal anomaly windows, and fine-grained attribution labels; the design of a spatiotemporal IoU-based reward mechanism to guide intermediate localization; and a three-stage progressive training paradigm combining supervised fine-tuning with reinforcement learning driven by dual reward signals. Experiments show a 25.7% accuracy gain on fine-grained anomaly benchmarks, an 11.7% reduction in generated video anomalies when used as a reward signal, and significant improvements in overall video quality.
📝 Abstract
In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.
Problem

Research questions and friction points this paper is trying to address.

video anomaly detection
spatiotemporal reasoning
fine-grained localization
reward modeling
anomaly grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Spatiotemporal Concentrating
Vision-Language Models
Anomaly Reward Model
Temporal and Spatial IoU Rewards
Group Relative Policy Optimization
🔎 Similar Papers
No similar papers found.