🤖 AI Summary
This work addresses the task of Japanese recipe generation—a culturally grounded, low-resource multimodal generation problem. We propose the first end-to-end multimodal large language model (MLLM) framework tailored to Japanese culinary culture. Methodologically, we extend the LLaVA architecture by integrating the Japanese-optimized LLaMA language model with a ResNet-50 visual encoder, and introduce two domain-specific innovations: (i) cuisine-aware visual prompt tuning and (ii) dish-name–step alignment constraints. Additionally, we employ culinary knowledge–enhanced instruction fine-tuning. Evaluated on our newly constructed JP-RecipeBench benchmark, our approach achieves a +12.6 BLEU-4 improvement over baselines; human evaluation confirms 87% of generated recipes are both operationally feasible and culturally appropriate. To our knowledge, this is the first system capable of generating structured, idiomatic Japanese cooking instructions directly from food images. The work establishes a reusable technical paradigm and publicly releases curated data resources for multilingual, domain-specific multimodal generation.