Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

๐Ÿ“… 2026-04-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
This work addresses the ambiguity in text-driven anomalous pedestrian retrieval caused by inconsistencies between pose and semanticsโ€”where distinct semantic behaviors may exhibit similar skeletal structures. To resolve this, the authors propose a Structure-Semantics Decoupled Cascade (SSDC) framework featuring a two-stage mechanism: first, a lightweight pose-aware model efficiently retrieves candidate videos based on skeletal structure; then, a multi-agent interaction module composed of Detective, Analyst, and Writer agents leverages multimodal large language models to perform semantic verification and reranking. This approach pioneers a decoupled yet collaborative retrieval paradigm that jointly optimizes structural and semantic reasoning, achieving state-of-the-art performance on the PAB benchmark while effectively balancing large-scale retrieval efficiency with complex semantic inference capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
Problem

Research questions and friction points this paper is trying to address.

Pose-Semantic Gap
Text-Based Person Anomaly Search
Multimodal Large Language Models
Behavioral Retrieval
Surveillance Archives
Innovation

Methods, ideas, or system contributions that make the work stand out.

pose-semantic gap
cascade retrieval
multimodal large language models
multi-agent verification
text-based person search
๐Ÿ”Ž Similar Papers