T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the partial modality alignment problem in video-text retrieval—arising when textual descriptions cover only subsets of video content—this paper proposes T2VParser. Building upon vision-language pretrained models (e.g., CLIP), T2VParser introduces cross-modal shared learnable decomposition tokens that enable adaptive semantic disentanglement and fine-grained matching between text and video representations, thereby avoiding erroneous supervision from holistic representation alignment. The model optimizes a partial alignment objective via contrastive learning, preserving pretrained knowledge while mitigating information asymmetry between modalities. Experiments demonstrate significant improvements in Recall@K across multiple standard video retrieval benchmarks, particularly enhancing precise matching of salient video content. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.
Problem

Research questions and friction points this paper is trying to address.

Aligning video content with partial text descriptions
Reducing misalignment in video-text matching tasks
Adaptive semantic alignment for cross-modal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Decomposition Tokens for cross-modal alignment
Multiview semantic representations from text and video
Partial alignment via effective content decomposition
🔎 Similar Papers
Y
Yili Li
Institute of Information Engineering, Chinese Academy of Sciences, School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
G
Gang Xiong
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
G
Gaopeng Gou
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Xiangyan Qu
Xiangyan Qu
IIE
J
Jiamin Zhuang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Z
Zhen Li
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
J
Junzheng Shi
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China