๐ค AI Summary
To address the challenges of multimodal (visual, audio, textual) joint understanding and fine-grained temporal localization in intelligent editing of long videos (up to one hour), this paper introduces the first end-to-end foundation model for long-video temporal retrieval. Methodologically, we design a multimodal joint encoder supporting variable-length inputs, integrating cross-modal alignment, long-range temporal modeling, and hierarchical attention to enable natural languageโdriven precise temporal localization. Our contributions are threefold: (1) We establish VUE-TR, a high-quality benchmark featuring human-annotated temporal ground truth and a multi-interval IoU evaluation metric; (2) On VUE-TR, our model significantly outperforms closed-source models including GPT-4o and Gemini, achieving substantial gains in temporal retrieval accuracy; and (3) Empirical results demonstrate strong practical utility and state-of-the-art performance in real-world video editing scenarios.
๐ Abstract
Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.