Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing image–recipe retrieval methods assume that food images fully encode recipe content; however, such images only depict the final dish’s appearance and fail to convey fine-grained, visually imperceptible cooking factors—e.g., knife skills, simmering duration, or spice combinations—leading to cross-modal representation bias, especially in multi-cuisine datasets. This work introduces, for the first time, a causal inference framework to explicitly model latent cooking variables omitted in images. We propose a unified architecture integrating causal representation learning, fine-grained cooking element prediction, and cross-modal alignment, thereby injecting implicit culinary knowledge into multimodal joint representations. Our method significantly enhances fine-grained recipe discrimination on both Recipe1M and a newly constructed multilingual, multicultural benchmark. It achieves substantial improvements in retrieval accuracy over current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.

Problem

Research questions and friction points this paper is trying to address.

Mitigating cross-modal bias in image-recipe retrieval across multicultural cuisines

Addressing overlooked culinary elements like ingredients and cooking methods

Improving retrieval of visually similar recipes with subtle textual differences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts overlooked culinary elements in images

Injects culinary elements into cross-modal learning

Uses causal representation learning for bias mitigation

🔎 Similar Papers

Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge