Interactive Multi-Turn Retrieval for Health Videos

📅 2026-05-02

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the limitations of existing single-turn health video retrieval systems, which struggle to handle clinically relevant scenarios where users’ initial queries are ambiguous and require multi-round refinement—such as specifying poses or contraindications. To bridge this gap, the authors construct MHVRC, the first multi-turn interactive retrieval corpus tailored for health videos, and propose DATR, a dialogue-aware two-stage retrieval framework. In the first stage, DATR employs a CLIP dual-encoder with sparse frame sampling for efficient coarse retrieval; in the second stage, it integrates multi-turn dialogue context via a lightweight cross-encoder to enable fine-grained re-ranking. Experiments demonstrate that DATR significantly outperforms strong baselines on MHVRC, and user studies confirm that multi-turn queries better capture procedural semantics, underscoring the method’s effectiveness and practical utility.

📝 Abstract

The growing availability of health-related instructional videos creates new opportunities for clinical training, patient rehabilitation, and health education, yet existing retrieval systems remain largely single-turn: a user submits one query and receives one ranked list. This interaction is brittle in health scenarios, where information needs are often vague at first and become clinically meaningful only after follow-up constraints such as posture, hand placement, contraindications, equipment, or patient condition are specified. We introduce interactive multi-turn semantic retrieval for health videos and construct MHVRC, a Multi-Turn Health Video Retrieval Corpus, by combining video-grounded descriptions from VideoChat-Flash with query refinements generated by DeepSeek. We further propose DATR, a Dialogue-Aware Two-Stage Retrieval framework. DATR first performs efficient coarse retrieval with a CLIP-style dual encoder and sparse frame sampling, then re-ranks the top candidates through multi-turn query fusion and a lightweight cross-encoder scoring module. Experiments on MHVRC show consistent gains over strong text-video retrieval baselines, while user studies indicate that refined multi-turn queries better capture fine-grained procedural semantics than single-turn annotations. The work establishes a benchmark and a scalable technical recipe for interactive health video retrieval.

Problem

Research questions and friction points this paper is trying to address.

interactive retrieval

multi-turn retrieval

health videos

information need refinement

clinical semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

interactive retrieval

multi-turn dialogue

health video retrieval