🤖 AI Summary
Existing VisDial methods overlook two critical causal structures: (1) shortcut bias induced by direct input of dialogue history into the answer model, and (2) unobserved confounders linking history, question, and answer, leading to spurious correlations. Method: This work is the first to systematically identify such causal misspecification in VisDial and proposes two general causal principles: (i) eliminating the direct path from history to the answer model, and (ii) explicitly modeling and intervening on latent confounders. Based on these, we design a model-agnostic causal intervention algorithm that departs from conventional likelihood-based estimation, integrating counterfactual intervention with confounder control. Contribution/Results: Our method establishes new state-of-the-art results across standard VisDial benchmarks, consistently improves all major baseline models, and enabled our team to win the Visual Dialog Challenge 2019.
📝 Abstract
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial). By "improving", we mean that they can promote almost every existing VisDial model to the state-of-the-art performance on the leader-board. Such a major improvement is only due to our careful inspection on the causality behind the model and data, finding that the community has overlooked two causalities in VisDial. Intuitively, Principle 1 suggests: we should remove the direct input of the dialog history to the answer model, otherwise a harmful shortcut bias will be introduced; Principle 2 says: there is an unobserved confounder for history, question, and answer, leading to spurious correlations from training data. In particular, to remove the confounder suggested in Principle 2, we propose several causal intervention algorithms, which make the training fundamentally different from the traditional likelihood estimation. Note that the two principles are model-agnostic, so they are applicable in any VisDial model.