HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval

📅 2025-10-27

🏛️ Proceedings of the 33rd ACM International Conference on Multimedia

📈 Citations: 3

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address referential ambiguity and insufficient fine-grained semantic attention in Compositional Video Retrieval (CVR), caused by cross-modal information density disparities between reference videos and modification texts, this paper proposes the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first to explicitly leverage inter-modal information density differences to jointly perform holistic pronoun disambiguation, atomic-level uncertainty modeling, and progressive “holistic → atomic” alignment—integrating cross-modal interaction with fine-grained semantic alignment. Evaluated on multiple CVR and Compositional Image Retrieval (CIR) benchmarks, HUD achieves state-of-the-art performance with strong generalization capability. Its core contributions are: (1) uncovering and formally modeling how modality-specific information density differences impede multimodal understanding; and (2) introducing a hierarchical, uncertainty-driven paradigm for disambiguation and alignment that bridges coarse- and fine-grained semantic representations across modalities.

Technology Category

Application Category

📝 Abstract

Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Addresses modification subject ambiguity in composed video retrieval

Resolves limited semantic focus in multimodal query understanding

Mitigates information density disparity between video and text modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical uncertainty-aware disambiguation network for video retrieval

Exploits video-text information density disparity for query understanding

Uses holistic and atomistic cross-modal interactions for feature learning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs