Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing text-based person search benchmarks predominantly focus on routine actions and lack explicit modeling of anomalous behaviors. To address this limitation, this work introduces the novel task of *text-driven person anomaly search*, which aims to precisely locate individuals—regardless of whether they exhibit normal or abnormal postures (e.g., falling or being knocked down)—using natural language queries. To support this task, we construct PAB, the first large-scale multimodal benchmark for pedestrian anomaly behavior, comprising 1.01 million synthetic and 1,978 real-world image–text pairs. We further propose a cross-modal pose-aware retrieval framework integrated with an identity-aware hard negative sampling strategy. On the PAB test set, our method achieves a Recall@1 of 84.93%, substantially outperforming baseline approaches. The dataset, models, and code are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval, and the proposed pose-aware method arrives at 84.93% recall@1 accuracy, surpassing other competitive methods. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/CMP.

Problem

Research questions and friction points this paper is trying to address.

Identifying abnormal pedestrian behaviors via text descriptions

Overcoming biases in current benchmarks for person search

Enhancing retrieval of fine-grained actions and anomalies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale image-text Pedestrian Anomaly Behavior benchmark

Cross-modal pose-aware framework integration

Synthetic training data for fine-grained retrieval

🔎 Similar Papers

No similar papers found.