Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the practical challenge of partial relevance between untrimmed long videos and text queries in video-text retrieval, this paper introduces the Partially Relevant Video Retrieval (PRVR) task. We propose DL-DKD++, a dual-learning framework that models fine-grained cross-modal temporal alignment: it employs a two-branch student network—comprising an inheritance branch and an exploration branch—and incorporates a dynamic knowledge distillation mechanism that adaptively generates soft targets for effective knowledge transfer from large-scale vision-language pretrained teacher models to lightweight, task-specific student models. The framework jointly integrates soft alignment supervision and contrastive learning, enabling end-to-end optimization. Extensive experiments on TVR, ActivityNet Captions, and Charades-STA demonstrate substantial improvements over state-of-the-art methods, validating DL-DKD++’s superior accuracy and robustness in handling partial relevance and untrimmed video inputs.

Technology Category

Application Category

📝 Abstract
Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at https://github.com/HuiGuanLab/DL-DKD.
Problem

Research questions and friction points this paper is trying to address.

Retrieving untrimmed videos with partial relevance to text queries
Addressing domain gaps between pre-trained models and task-specific data
Capturing fine-grained partial relevance in long-duration complex videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual learning framework with dynamic knowledge distillation
Compact dual-branch student network with inheritance and exploration
Dynamic soft-target construction replacing rigid hard-target supervision
🔎 Similar Papers
No similar papers found.
J
Jianfeng Dong
College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310035, China
L
Lei Huang
College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310035, China
Daizong Liu
Daizong Liu
Wuhan University
Computer VisionVision and Language3D UnderstandingAdversarial RobustnessLVLM
X
Xianke Chen
College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310035, China
X
Xun Yang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China
Changting Lin
Changting Lin
Zhejiang University
Computer Science
X
Xun Wang
College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310035, China
M
Meng Wang
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China