UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the inefficiency of manual inspection in urban surveillance video analysis by proposing a visual analytics system that segments long videos into short clips and leverages vision-language models to generate semantic descriptions for indexing. The system integrates retrieval-augmented generation (RAG), domain-specific knowledge graphs, and video-entity alignment to enable semantic-driven event retrieval and visual verification. It innovatively combines taxonomy-aware entity extraction with video grounding mechanisms to enhance consistency between textual reasoning and visual evidence. Evaluations on the StreetAware dataset for hazardous scene detection and pedestrian crossing analysis demonstrate that the system substantially reduces analysts’ cognitive load while improving both analytical efficiency and result reliability.

Technology Category

Application Category

📝 Abstract

Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.

Problem

Research questions and friction points this paper is trying to address.

urban video analysis

event retrieval

scene interpretation

visual analytics

long-duration video

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval-augmented generation

taxonomy-aware entity extraction

video grounding