DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

When adapting CLIP from image-text pretraining to video-text retrieval, three critical mismatches arise: (1) visual modality mismatch (frames → videos), (2) linguistic modality mismatch (short image captions → lengthy video descriptions), and (3) cross-modal alignment mismatch. To address these jointly, this paper proposes the first parameter-efficient transfer framework that simultaneously models and mitigates all three mismatches. Its core innovations are: (1) an image-video feature fusion module that explicitly captures temporal dynamics; (2) pseudo-image caption generation coupled with image-level alignment knowledge distillation, transferring fine-grained image-text alignment capability to video-text retrieval; and (3) lightweight adapter-based fine-tuning. Evaluated on MSRVTT using CLIP-ViT-B/16, our method achieves R@1 = 50.5%, surpassing prior state-of-the-art by 1.5 percentage points—demonstrating the effectiveness of coordinated mitigation of the three mismatches.

Technology Category

Application Category

📝 Abstract

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

Problem

Research questions and friction points this paper is trying to address.

Reduces vision-language discrepancies in video-text retrieval

Addresses alignment gaps between image and video levels

Improves parameter-efficient adaptation of CLIP for videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image-Video Features Fusion for vision-language

Pseudo image captions for fine-grained alignment

Image-to-Video Alignment Distillation technique

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs