Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

In recipe–image cross-modal retrieval, visual–textual representation discrepancies arise from cooking procedures, food presentation styles, and imaging conditions. To address this, we propose a causal inference–based debiased representation learning framework. Our method treats ingredients as confounders and applies backdoor adjustment to eliminate their spurious influence on image–text similarity estimation; introduces a plug-and-play multi-label ingredient classification module to explicitly model fine-grained semantic correspondences; and incorporates a causal intervention term into the contrastive learning objective to achieve asymmetric cross-modal alignment. Evaluated on Recipe1M, our approach achieves MedR = 1—i.e., top-rank retrieval for all queries—setting a new state-of-the-art. It further demonstrates strong robustness and generalization across retrieval scales ranging from 1K to 50K samples.

Technology Category

Application Category

📝 Abstract

This paper addresses the challenges of learning representations for recipes and food images in the cross-modal retrieval problem. As the relationship between a recipe and its cooked dish is cause-and-effect, treating a recipe as a text source describing the visual appearance of a dish for learning representation, as the existing approaches, will create bias misleading image-and-recipe similarity judgment. Specifically, a food image may not equally capture every detail in a recipe, due to factors such as the cooking process, dish presentation, and image-capturing conditions. The current representation learning tends to capture dominant visual-text alignment while overlooking subtle variations that determine retrieval relevance. In this paper, we model such bias in cross-modal representation learning using causal theory. The causal view of this problem suggests ingredients as one of the confounder sources and a simple backdoor adjustment can alleviate the bias. By causal intervention, we reformulate the conventional model for food-to-recipe retrieval with an additional term to remove the potential bias in similarity judgment. Based on this theory-informed formulation, we empirically prove the oracle performance of retrieval on the Recipe1M dataset to be MedR=1 across the testing data sizes of 1K, 10K, and even 50K. We also propose a plug-and-play neural module, which is essentially a multi-label ingredient classifier for debiasing. New state-of-the-art search performances are reported on the Recipe1M dataset.

Problem

Research questions and friction points this paper is trying to address.

Addresses bias in cross-modal food image and recipe representation learning

Models bias using causal theory with ingredients as confounder sources

Proposes debiasing method to improve food-to-recipe retrieval accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses causal theory to model bias

Applies backdoor adjustment for bias removal

Introduces plug-and-play neural ingredient classifier

🔎 Similar Papers

No similar papers found.