UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study evaluates Video-LLMs’ capabilities in memory, perception, reasoning, and navigation over first-person continuous video in urban embodied intelligence. To this end, we introduce CityEmbodied—the first embodied video-language benchmark for open urban 3D environments—integrating dual-source 3D video data from drone-captured real-world footage and photorealistic simulation, annotated via human labeling and automated pipelines to yield 5.2K multiple-choice questions. We systematically evaluate 17 state-of-the-art Video-LLMs under a novel city-scale embodied video evaluation framework. Our analysis reveals, for the first time, a strong correlation between causal reasoning ability and multi-task performance, and empirically validates the effectiveness of sim-to-real transfer for embodied video understanding. Results demonstrate that current models exhibit substantial limitations in urban embodied cognition, with causal reasoning constituting the primary bottleneck. Critically, performance improves significantly after simulation-based pretraining followed by real-world fine-tuning.

Technology Category

Application Category

📝 Abstract
Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Evaluate Video-LLMs' ability to process first-person urban video data.
Assess embodied cognition in urban spaces using drone-collected video clips.
Explore correlations between reasoning, recall, perception, and navigation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking Video-LLMs in urban 3D spaces
Drone-collected 3D video data for evaluation
Sim-to-Real transfer validation through fine-tuning
Baining Zhao
Baining Zhao
Tsinghua University
Jianjie Fang
Jianjie Fang
Master of Tsinghua University
Embodied AI、LLMs
Z
Zichao Dai
Tsinghua University
Z
Ziyou Wang
Tsinghua University
J
Jirong Zha
Tsinghua University
Weichen Zhang
Weichen Zhang
PhD, University of Sydney
Computer VisionDeep LearningTransfer LearningDomain Adaptation
C
Chen Gao
Tsinghua University
Y
Yue Wang
Tsinghua University
Jinqiang Cui
Jinqiang Cui
PCL
LLM/VLM+Multi-robots system
X
Xinlei Chen
Tsinghua University
Y
Yong Li
Tsinghua University