BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In text-to-video retrieval, vision-language bias causes models to overlook fine-grained semantic details. To address this, we propose a scene-element-guided bidirectional debiasing framework. First, it performs fine-grained visual enhancement of video representations by generating and embedding scene elements—explicitly incorporating scene-structure priors into cross-modal alignment. Second, it introduces a text feature disentanglement module that separates content-relevant components from bias-correlated ones. Crucially, this is the first work to explicitly integrate structured scene priors into the alignment process, coupled with bidirectional contrastive learning for joint optimization. Evaluated on five major benchmarks—including MSR-VTT and MSVD—the framework achieves state-of-the-art performance. Moreover, it demonstrates significantly improved robustness and generalization on out-of-distribution retrieval tasks, empirically validating that debiasing enhances the model’s intrinsic semantic understanding capability.

Technology Category

Application Category

📝 Abstract

Text-video retrieval (TVR) systems often suffer from visual-linguistic biases present in datasets, which cause pre-trained vision-language models to overlook key details. To address this, we propose BiMa, a novel framework designed to mitigate biases in both visual and textual representations. Our approach begins by generating scene elements that characterize each video by identifying relevant entities/objects and activities. For visual debiasing, we integrate these scene elements into the video embeddings, enhancing them to emphasize fine-grained and salient details. For textual debiasing, we introduce a mechanism to disentangle text features into content and bias components, enabling the model to focus on meaningful content while separately handling biased information. Extensive experiments and ablation studies across five major TVR benchmarks (i.e., MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo) demonstrate the competitive performance of BiMa. Additionally, the model's bias mitigation capability is consistently validated by its strong results on out-of-distribution retrieval tasks.

Problem

Research questions and friction points this paper is trying to address.

Mitigating visual-linguistic biases in text-video retrieval systems

Enhancing video embeddings with scene elements for fine-grained details

Disentangling text features to separate content from biased information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates scene elements for video characterization

Integrates scene elements into video embeddings

Disentangles text features into content and bias

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

2024-08-14Citations: 0

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

2024-04-22IEEE transactions on circuits and systems for video technology (Print)Citations: 0

Authors to Follow