🤖 AI Summary
This work addresses the challenge of fine-grained spatiotemporal understanding and generation in long-duration videos, targeting three core tasks: multimodal temporal retrieval (TR), spatiotemporal grounding (STG), and video question answering (Video QA). We propose an end-to-end fine-grained spatiotemporal localization architecture, introduce VUE-STG—a high-quality long-context benchmark for STG—and upgrade VUE-TR-V2. To enable rigorous evaluation, we devise a joint vIoU/tIoU-Intersection metric supporting both spatiotemporal grounding and cross-modal temporal reasoning. Our method leverages large multimodal models to achieve text-driven precise spatiotemporal localization and multi-turn video QA. Experiments demonstrate substantial improvements over closed-source models—including Gemini 3 Pro and GPT-5—on VUE-TR-V2 and VUE-STG, while matching state-of-the-art open-source models of comparable scale on Video QA. The framework establishes a scalable, high-precision foundation for multimodal video reasoning, with direct applicability to complex video editing and other downstream tasks.
📝 Abstract
Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.