FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

📅 2024-09-27
🏛️ Conference on Multimedia Modeling
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the task of Japanese recipe generation—a culturally grounded, low-resource multimodal generation problem. We propose the first end-to-end multimodal large language model (MLLM) framework tailored to Japanese culinary culture. Methodologically, we extend the LLaVA architecture by integrating the Japanese-optimized LLaMA language model with a ResNet-50 visual encoder, and introduce two domain-specific innovations: (i) cuisine-aware visual prompt tuning and (ii) dish-name–step alignment constraints. Additionally, we employ culinary knowledge–enhanced instruction fine-tuning. Evaluated on our newly constructed JP-RecipeBench benchmark, our approach achieves a +12.6 BLEU-4 improvement over baselines; human evaluation confirms 87% of generated recipes are both operationally feasible and culturally appropriate. To our knowledge, this is the first system capable of generating structured, idiomatic Japanese cooking instructions directly from food images. The work establishes a reusable technical paradigm and publicly releases curated data resources for multilingual, domain-specific multimodal generation.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Improving Japanese recipe generation using Multimodal Large Language Models.
Enhancing food image understanding through recipe data in Japanese.
Benchmarking open MLLMs against GPT-4o for recipe accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLaVA-1.5 and Phi-3 Vision models
Benchmarked against GPT-4o for recipe generation
Achieved higher F1 score in ingredient generation
🔎 Similar Papers
No similar papers found.