ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the challenge of person retrieval in multi-camera settings based on vague eyewitness descriptions by formulating it for the first time as an interactive reasoning task for an embodied agent, which must locate the target under conditions of information asymmetry and a limited number of interaction rounds. The proposed approach centers on a large language model acting as the core reasoning agent, augmented with a spatio-temporal topology graph (STTG) that encodes camera topology and empirically derived transition times, along with domain-specific tools to enable coordinated semantic, spatial, and temporal reasoning. Evaluated across 2,691 tasks in 14 real-world scenarios, the best-performing model achieves TWS scores of 0.383 and 0.590 on Track 2 and Track 3, respectively. Ablation studies further demonstrate that removing domain-specific tools can reduce accuracy by up to 49.6 percentage points.

Technology Category

Application Category

📝 Abstract

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Problem

Research questions and friction points this paper is trying to address.

multi-camera person search

interactive reasoning

information asymmetry

spatio-temporal reasoning

agent-based search

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic reasoning

multi-camera person search

spatio-temporal topology graph