Two Causal Principles for Improving Visual Dialog

📅 2019-11-24
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 158
Influential: 7
📄 PDF
🤖 AI Summary
Existing VisDial methods overlook two critical causal structures: (1) shortcut bias induced by direct input of dialogue history into the answer model, and (2) unobserved confounders linking history, question, and answer, leading to spurious correlations. Method: This work is the first to systematically identify such causal misspecification in VisDial and proposes two general causal principles: (i) eliminating the direct path from history to the answer model, and (ii) explicitly modeling and intervening on latent confounders. Based on these, we design a model-agnostic causal intervention algorithm that departs from conventional likelihood-based estimation, integrating counterfactual intervention with confounder control. Contribution/Results: Our method establishes new state-of-the-art results across standard VisDial benchmarks, consistently improves all major baseline models, and enabled our team to win the Visual Dialog Challenge 2019.
📝 Abstract
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise a harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.
Problem

Research questions and friction points this paper is trying to address.

Identifies overlooked causal relationships in visual dialog models
Proposes model-agnostic principles to remove harmful biases and confounders
Introduces causal intervention algorithms to improve training and performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Removing dialog history input prevents shortcut bias
Causal intervention algorithms address unobserved confounder
Model-agnostic principles enhance existing VisDial models