DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Addressing the inherent challenges of multi-event understanding and hand-object interaction recognition in first-person video question answering, this paper proposes a dual-modal counterfactual contrastive learning framework. To enhance causal reasoning, it introduces, for the first time, a counterfactual sample construction mechanism: leveraging event paraphrasing and core interaction mining, it generates positive and negative counterfactual samples across both textual and visual modalities, and employs contrastive loss to strengthen the model’s ability to discriminate causal associations. Built upon a pretraining-finetuning paradigm, the framework integrates event paraphrasing, interaction mining, and dual-modal counterfactual modeling. Experimental results demonstrate state-of-the-art performance: 52.51% and 46.04% accuracy on the EgoTaskQA normal and indirect subsets, respectively, and 13.2% on QAEGO4D—surpassing existing methods.

Technology Category

Application Category

📝 Abstract

Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC$^3$) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51% and 46.04% on the extit{normal} and extit{indirect} splits of EgoTaskQA, and 13.2% on QAEGO4D, both reaching the state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Addressing multi-event understanding in first-person videos

Improving hand-object interaction recognition for egocentric tasks

Enhancing VideoQA with counterfactual contrastive learning techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates counterfactual samples via paraphrasing and interaction mining

Applies contrastive loss to optimize positive and negative sample distances

Uses dual-modal counterfactual contrastive construction for VideoQA

🔎 Similar Papers

MM-Ego: Towards Building Egocentric Multimodal LLMs