MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitations of existing industrial anomaly detection datasets, which predominantly rely on static images or sparse viewpoints and thus fail to capture the continuous, multi-view nature of real-world production-line inspection scenarios. To bridge this gap, the authors introduce MMVIAD—the first continuous multi-view industrial video anomaly detection dataset—and propose the VISTA framework. VISTA incorporates Perceptual Structured Fine-Tuning (PS-SFT) and a Visibility-guided Temporal Reward-based Policy Optimization mechanism (VISTA-GRPO), enabling comprehensive multi-task evaluation including anomaly detection, defect classification, object recognition, and temporal localization. On the MMVIAD-Unseen benchmark, VISTA achieves a substantial improvement in average score from 45.0 to 57.5, significantly outperforming both open-source and commercial video foundation models, and demonstrates exceptional generalization capabilities in fine-grained defect identification and temporal localization tasks.

📝 Abstract

Industrial anomaly detection is critical for manufacturing quality control, yet existing datasets mainly focus on static images or sparse views, which do not fully reflect continuous inspection processes in real industrial scenarios. We introduce MMVIAD (Multi-view Multi-task Video Industrial Anomaly Detection), to the best of our knowledge the first continuous multi-view video dataset for industrial anomaly detection and understanding, together with a benchmark for multi-task evaluation. MMVIAD contains object-centric 2-second inspection clips with approximately 120 degrees of camera motion, covering 48 object categories, 14 environments, and 6 structural anomaly types. It supports anomaly detection, defect classification, object classification, and anomaly visible-time localization. Systematic evaluations on MMVIAD show that current commercial and open-source video MLLMs remain far below human performance, especially for fine-grained defect recognition and temporal grounding. To improve transferable anomaly understanding, we further develop a two-stage post-training pipeline where PS-SFT (Perception-Structured Supervised Fine-Tuning) initializes perception-structured reasoning and VISTA-GRPO (Visibility-grounded Industrial Structured Temporal Anomaly Group Relative Policy Optimization) refines the model with semantic-gated defect reward and visibility-aware temporal reward, producing the final model VISTA. On MMVIAD-Unseen, VISTA improves the base model's average score across the four tasks from 45.0 to 57.5, surpassing GPT-5.4. Source code is available at https://github.com/Georgekeepmoving/MMVIAD.

Problem

Research questions and friction points this paper is trying to address.

industrial anomaly detection

multi-view video

continuous inspection

defect recognition

temporal grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view video

industrial anomaly detection

multi-task learning