IVCR-200K: A Large-Scale Multi-turn Dialogue Benchmark for Interactive Video Corpus Retrieval

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing video retrieval systems operate in static, unidirectional modes, failing to accommodate users’ personalized and dynamic information needs—80.8% of users report insufficient adaptability. Method: This paper introduces Interactive Video Corpus Retrieval (IVCR), the first task to incorporate multi-turn dialogue into video retrieval, enabling fine-grained temporal localization and high-level semantic understanding. We construct IVCR-200K, a high-quality bilingual multimodal dataset, and propose a unified framework based on multimodal large language models (MLLMs), integrating cross-modal alignment, dialogue state tracking, and response generation for joint text–video and video-clip retrieval. Contribution/Results: Experiments demonstrate substantial improvements in retrieval accuracy and user satisfaction under multi-turn interaction. Our work establishes a new benchmark and open-source platform for interactive video retrieval, advancing beyond conventional static paradigms.

Technology Category

Application Category

📝 Abstract

In recent years, significant developments have been made in both video retrieval and video moment retrieval tasks, which respectively retrieve complete videos or moments for a given text query. These advancements have greatly improved user satisfaction during the search process. However, previous work has failed to establish meaningful "interaction" between the retrieval system and the user, and its one-way retrieval paradigm can no longer fully meet the personalization and dynamic needs of at least 80.8% of users. In this paper, we introduce the Interactive Video Corpus Retrieval (IVCR) task, a more realistic setting that enables multi-turn, conversational, and realistic interactions between the user and the retrieval system. To facilitate research on this challenging task, we introduce IVCR-200K, a high-quality, bilingual, multi-turn, conversational, and abstract semantic dataset that supports video retrieval and even moment retrieval. Furthermore, we propose a comprehensive framework based on multi-modal large language models (MLLMs) to help users interact in several modes with more explainable solutions. The extensive experiments demonstrate the effectiveness of our dataset and framework.

Problem

Research questions and friction points this paper is trying to address.

Develops a multi-turn interactive video retrieval system

Addresses lack of user-system interaction in video search

Enables conversational and personalized video moment retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn conversational interactive video retrieval system

Bilingual multi-turn abstract semantic dataset IVCR-200K

Multi-modal large language model framework for explainable solutions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs