🤖 AI Summary
Existing text-to-video retrieval methods face two key challenges: heavy reliance on large-scale annotated data—leading to prohibitive annotation costs—and substantial modality gaps between video and text, coupled with insufficient exploitation of dynamic information, which hinders precise cross-modal alignment. To address these issues, we propose Dynamic Region-Aware CLIP (DRA-CLIP), a frame-difference-guided framework that explicitly generates dynamic region masks from inter-frame differences and employs them as alpha channels to steer attention toward motion-critical regions while suppressing static background interference. Furthermore, we design a lightweight Alpha-CLIP architecture enabling end-to-end fusion of dynamic masks with frozen CLIP visual and textual features. Extensive experiments across multiple benchmarks demonstrate significant improvements in retrieval accuracy—e.g., +3.2% R@1 on MSR-VTT—while maintaining efficient inference without introducing additional parameters or requiring extra pretraining.
📝 Abstract
With the rapid growth of video data, text-video retrieval technology has become increasingly important in numerous application scenarios such as recommendation and search. Early text-video retrieval methods suffer from two critical drawbacks: first, they heavily rely on large-scale annotated video-text pairs, leading to high data acquisition costs; second, there is a significant modal gap between video and text features, which limits cross-modal alignment accuracy. With the development of vision-language model, adapting CLIP to video tasks has attracted great attention. However, existing adaptation methods generally lack enhancement for dynamic video features and fail to effectively suppress static redundant features. To address this issue, this paper proposes FDA-CLIP (Frame Difference Alpha-CLIP), which is a concise CLIP-based training framework for text-video alignment. Specifically, the method uses frame differences to generate dynamic region masks, which are input into Alpha-CLIP as an additional Alpha channel. This proactively guides the model to focus on semantically critical dynamic regions while suppressing static background redundancy. Experiments demonstrate that frame difference-guided video semantic encoding can effectively balance retrieval efficiency and accuracy.