🤖 AI Summary
This work addresses the challenge of capturing fine-grained multimodal cultural variations in Biryani preparation across India—specifically in cooking steps, ingredients, and presentation—that existing video understanding methods struggle to model. To this end, we introduce the first large-scale, regionally annotated Biryani cooking video dataset, comprising 12 regional styles and 120 high-quality videos. We propose a multi-stage framework that leverages vision-language models for fine-grained video segmentation and aligns these segments cross-modally with audio transcripts and canonical recipes. Additionally, we develop a human-feedback-informed contrastive analysis pipeline and a multi-level reasoning-based question-answering benchmark. Experiments demonstrate that our approach effectively discerns regional culinary distinctions, achieving strong performance under both zero-shot and fine-tuned settings. The released dataset, code, and benchmark establish a new platform for computational studies of multimodal cultural practices.
📝 Abstract
Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos. We release all data, code, and the project website at https://farzanashaju.github.io/how-does-india-cook-biryani/.