Sekai: A Video Dataset towards World Exploration

šŸ“… 2025-06-18
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing video generation datasets suffer from narrow geographic coverage, short durations, static scenes, and insufficient semantic annotations—limiting their utility for world exploration modeling. To address this, we introduce Sekai, the first high-quality, first-person video dataset explicitly designed for interactive world exploration: it spans 750 cities across 100+ countries and comprises over 5,000 hours of walking and aerial footage. We develop an end-to-end pipeline for acquisition, preprocessing, and annotation—enabling automated geocoding, multimodal semantic labeling (e.g., location, weather, crowd density, descriptive captions, camera trajectory), and spatiotemporally aligned fine-grained scene understanding. Sekai overcomes four key bottlenecks—geographic breadth, temporal length, scene dynamics, and semantic richness—thereby enabling the training of YUME, a novel exploration model that significantly improves spatial consistency and exploration plausibility in generated outputs.

Technology Category

Application Category

šŸ“ Abstract
Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.
Problem

Research questions and friction points this paper is trying to address.

Existing video datasets lack diversity for world exploration training
Current datasets have limited locations, duration, and annotations
Need high-quality annotated videos for interactive world exploration models
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality first-person worldwide video dataset
Efficient toolbox for video collection and annotation
Interactive video world exploration model YUME
šŸ”Ž Similar Papers
No similar papers found.
Z
Zhen Li
Shanghai AI Laboratory, Beijing Institute of Technology, Shenzhen MSU-BIT University
C
Chuanhao Li
Shanghai AI Laboratory
Xiaofeng Mao
Xiaofeng Mao
Alibaba Group
Computer VisionAdversarial Machine Learning
S
Shaoheng Lin
Shanghai AI Laboratory
M
Ming Li
Shanghai AI Laboratory
Shitian Zhao
Shitian Zhao
Shanghai AI Lab
LLMMLLMGenerative Model
Z
Zhaopan Xu
Shanghai AI Laboratory
X
Xinyue Li
Shanghai AI Laboratory
Y
Yukang Feng
Shanghai Innovation Institute
Jianwen Sun
Jianwen Sun
Software Engineering Application Technology Lab, Huawei, China
Software engineeringDeep reinforcement learning
Z
Zizhen Li
Shanghai Innovation Institute
F
Fanrui Zhang
Shanghai Innovation Institute
J
Jiaxin Ai
Shanghai Innovation Institute
Zhixiang Wang
Zhixiang Wang
University of Tokyo
Computational PhotographyComputational ImagingMachine Learning
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
T
Tong He
Shanghai AI Laboratory
J
Jiangmiao Pang
Shanghai AI Laboratory
Y
Yu Qiao
Shanghai AI Laboratory
Y
Yunde Jia
Shanghai AI Laboratory
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC