🤖 AI Summary
In recipe–image cross-modal retrieval, visual–textual representation discrepancies arise from cooking procedures, food presentation styles, and imaging conditions. To address this, we propose a causal inference–based debiased representation learning framework. Our method treats ingredients as confounders and applies backdoor adjustment to eliminate their spurious influence on image–text similarity estimation; introduces a plug-and-play multi-label ingredient classification module to explicitly model fine-grained semantic correspondences; and incorporates a causal intervention term into the contrastive learning objective to achieve asymmetric cross-modal alignment. Evaluated on Recipe1M, our approach achieves MedR = 1—i.e., top-rank retrieval for all queries—setting a new state-of-the-art. It further demonstrates strong robustness and generalization across retrieval scales ranging from 1K to 50K samples.
📝 Abstract
This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.