Vidi2: Large Multimodal Models for Video Understanding and Creation

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of fine-grained spatiotemporal understanding and generation in long-duration videos, targeting three core tasks: multimodal temporal retrieval (TR), spatiotemporal grounding (STG), and video question answering (Video QA). We propose an end-to-end fine-grained spatiotemporal localization architecture, introduce VUE-STG—a high-quality long-context benchmark for STG—and upgrade VUE-TR-V2. To enable rigorous evaluation, we devise a joint vIoU/tIoU-Intersection metric supporting both spatiotemporal grounding and cross-modal temporal reasoning. Our method leverages large multimodal models to achieve text-driven precise spatiotemporal localization and multi-turn video QA. Experiments demonstrate substantial improvements over closed-source models—including Gemini 3 Pro and GPT-5—on VUE-TR-V2 and VUE-STG, while matching state-of-the-art open-source models of comparable scale on Video QA. The framework establishes a scalable, high-precision foundation for multimodal video reasoning, with direct applicability to complex video editing and other downstream tasks.

Technology Category

Application Category

📝 Abstract
Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Advancing video understanding with fine-grained spatio-temporal grounding capabilities
Extending multimodal reasoning to video question answering tasks
Enabling end-to-end spatio-temporal localization for complex video editing scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vidi2 enables fine-grained spatio-temporal video grounding
It integrates video question answering for multimodal reasoning
The model introduces new benchmarks for comprehensive evaluation
🔎 Similar Papers
No similar papers found.
C
Celong Liu
Intelligent Creation, ByteDance Inc.
Chia-Wen Kuo
Chia-Wen Kuo
ByteDance US
MultimodalVision and Language
C
Chuang Huang
Intelligent Creation, ByteDance Inc.
Dawei Du
Dawei Du
Research Scientist at ByteDance, Inc.
computer visionmachine learningdeep learning
F
Fan Chen
Intelligent Creation, ByteDance Inc.
G
Guang Chen
Intelligent Creation, ByteDance Inc.
H
Haoji Zhang
Intelligent Creation, ByteDance Inc.
H
Haojun Zhao
Intelligent Creation, ByteDance Inc.
L
Lingxi Zhang
Intelligent Creation, ByteDance Inc.
Lu Guo
Lu Guo
Bytedance/TikTok
Information ScienceAINLPcomputational social scienceLLMs
L
Lusha Li
Intelligent Creation, ByteDance Inc.
Longyin Wen
Longyin Wen
Bytedance Inc.
Artificial IntelligenceComputer VisionMachine Learning
Qihang Fan
Qihang Fan
Phd Student, Institute of Automation, Chinese Academy of Sciences
computer visionmulti-modal large language modeldeep learning architecture
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
R
Rachel Deng
Intelligent Creation, ByteDance Inc.
Sijie Zhu
Sijie Zhu
Unknown affiliation
S
Stuart Siew
Intelligent Creation, ByteDance Inc.
T
Tong Jin
Intelligent Creation, ByteDance Inc.
W
Weiyan Tao
Intelligent Creation, ByteDance Inc.
W
Wen Zhong
Intelligent Creation, ByteDance Inc.
Xiaohui Shen
Xiaohui Shen
ByteDance Research
Computer Vision
X
Xin Gu
Intelligent Creation, ByteDance Inc.
Zhenfang Chen
Zhenfang Chen
MIT-IBM Watson AI Lab
Vision-Language ModelsMultimodal AINeuro-Symbolic AI
Z
Zuhua Lin
Intelligent Creation, ByteDance Inc.