UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the inefficiency of manual inspection in urban surveillance video analysis by proposing a visual analytics system that segments long videos into short clips and leverages vision-language models to generate semantic descriptions for indexing. The system integrates retrieval-augmented generation (RAG), domain-specific knowledge graphs, and video-entity alignment to enable semantic-driven event retrieval and visual verification. It innovatively combines taxonomy-aware entity extraction with video grounding mechanisms to enhance consistency between textual reasoning and visual evidence. Evaluations on the StreetAware dataset for hazardous scene detection and pedestrian crossing analysis demonstrate that the system substantially reduces analysts’ cognitive load while improving both analytical efficiency and result reliability.

Technology Category

Application Category

📝 Abstract
Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.
Problem

Research questions and friction points this paper is trying to address.

urban video analysis
event retrieval
scene interpretation
visual analytics
long-duration video
Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation
taxonomy-aware entity extraction
video grounding
visual analytics
vision-language model
🔎 Similar Papers
No similar papers found.