SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models

📅 2025-06-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video anomaly detection (VAD) benchmarks lack suitability for smart home environments, hindering evaluation of models on realistic domestic anomalies. Method: We introduce SmartHome-Bench, the first multimodal VAD benchmark tailored to intelligent households—comprising 1,203 videos annotated with fine-grained, causally grounded labels across seven real-world household anomalies (e.g., elderly falling, infant choking). We propose a novel anomaly taxonomy and the Taxonomy-Driven Reflective LLM Chain (TRLC) framework, integrating prompt engineering and chain-of-thought reasoning to enhance discriminative capability. Contribution/Results: Our work presents the first systematic evaluation of mainstream closed- and open-source multimodal large language models (MLLMs) on household VAD. TRLC achieves a +11.62% accuracy gain over strong baselines on SmartHome-Bench. The dataset, annotations, and code are fully open-sourced, revealing critical limitations of current MLLMs in fine-grained domestic anomaly understanding and causal inference.

Technology Category

Application Category

📝 Abstract
Video anomaly detection (VAD) is essential for enhancing safety and security by identifying unusual events across different environments. Existing VAD benchmarks, however, are primarily designed for general-purpose scenarios, neglecting the specific characteristics of smart home applications. To bridge this gap, we introduce SmartHome-Bench, the first comprehensive benchmark specially designed for evaluating VAD in smart home scenarios, focusing on the capabilities of multi-modal large language models (MLLMs). Our newly proposed benchmark consists of 1,203 videos recorded by smart home cameras, organized according to a novel anomaly taxonomy that includes seven categories, such as Wildlife, Senior Care, and Baby Monitoring. Each video is meticulously annotated with anomaly tags, detailed descriptions, and reasoning. We further investigate adaptation methods for MLLMs in VAD, assessing state-of-the-art closed-source and open-source models with various prompting techniques. Results reveal significant limitations in the current models' ability to detect video anomalies accurately. To address these limitations, we introduce the Taxonomy-Driven Reflective LLM Chain (TRLC), a new LLM chaining framework that achieves a notable 11.62% improvement in detection accuracy. The benchmark dataset and code are publicly available at https://github.com/Xinyi-0724/SmartHome-Bench-LLM.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized benchmarks for smart home video anomaly detection
Current models struggle with accurate video anomaly detection
Need for improved methods to enhance detection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for smart home VAD
Uses multi-modal large language models
Introduces Taxonomy-Driven Reflective LLM Chain
🔎 Similar Papers
Xinyi Zhao
Xinyi Zhao
Columbia university
Data ScienceData Visualization
C
Congjing Zhang
University of Washington
Pei Guo
Pei Guo
Soochow University
LLMsNatural Language Generation
W
Wei Li
Wyze Labs, Inc.
L
Lin Chen
Wyze Labs, Inc.
C
Chaoyue Zhao
University of Washington
S
Shuai Huang
University of Washington