🤖 AI Summary
Current visual question answering (VQA) models exhibit strong Western dietary bias, struggling to comprehend the diversity and intricate culinary logic of Indian food culture—particularly in multi-step relational reasoning and contextual modeling tasks. To address this, we propose the first reasoning-chain-enhanced VQA framework tailored for Indian cuisine: it integrates a small language model with a vision-language model to generate verifiable, multi-step reasoning chains; employs reinforcement learning to optimize reasoning paths; and incorporates a food knowledge graph to enable context-aware and cross-entity relational inference. This work is the first to systematically apply multi-step reasoning mechanisms to food-domain VQA, significantly improving semantic parsing depth and interpretability for non-Western dietary cultures. Evaluated on an Indian food VQA benchmark, our method achieves a 10-percentage-point average accuracy gain, demonstrating both the efficacy of the reasoning-chain design and its cross-cultural generalization capability.
📝 Abstract
The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task.
Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.