ReTrack: Evidence-Driven Dual-Stream Directional Anchor Calibration Network for Composed Video Retrieval

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses feature synthesis bias, entangled modality contributions, and retrieval uncertainty in compositional video retrieval caused by disparities in modality information density. To tackle these challenges, the authors propose ReTrack, a novel framework comprising three core components: semantic contribution disentanglement, compositional geometric calibration, and reliable evidence-driven alignment. ReTrack introduces, for the first time, a directional anchor calibration mechanism that enables orientation correction of multimodal query features and explicit separation of modality-specific contributions. Furthermore, it leverages bidirectional evidential reasoning to enhance the reliability of similarity estimation. Built upon a dual-stream architecture with semantic weight estimation, ReTrack achieves state-of-the-art performance on three benchmark datasets across both compositional video and image retrieval tasks, demonstrating strong generalization capabilities.

Technology Category

Application Category

📝 Abstract

With the rapid growth of video data, Composed Video Retrieval (CVR) has emerged as a novel paradigm in video retrieval and is receiving increasing attention from researchers. Unlike unimodal video retrieval methods, the CVR task takes a multi-modal query consisting of a reference video and a piece of modification text as input. The modification text conveys the user's intended alterations to the reference video. Based on this input, the model aims to retrieve the most relevant target video. In the CVR task, there exists a substantial discrepancy in information density between video and text modalities. Traditional composition methods tend to bias the composed feature toward the reference video, which leads to suboptimal retrieval performance. This limitation is significant due to the presence of three core challenges: (1) modal contribution entanglement, (2) explicit optimization of composed features, and (3) retrieval uncertainty. To address these challenges, we propose the evidence-dRivRn dual-sTream diRectionAl anChor calibration networK (ReTrack). ReTrack is the first CVR framework that improves multi-modal query understanding by calibrating directional bias in composed features. It consists of three key modules: Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment. Specifically, ReTrack estimates the semantic contribution of each modality to calibrate the directional bias of the composed feature. It then uses the calibrated directional anchors to compute bidirectional evidence that drives reliable composed-to-target similarity estimation. Moreover, ReTrack exhibits strong generalization to the Composed Image Retrieval (CIR) task, achieving SOTA performance across three benchmark datasets in both CVR and CIR scenarios. Codes are available at https://github.com/Lee-zixu/ReTrack

Problem

Research questions and friction points this paper is trying to address.

Composed Video Retrieval

multi-modal query

information density discrepancy

modal contribution entanglement

retrieval uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Video Retrieval

Directional Anchor Calibration

Evidence-Driven Alignment