🤖 AI Summary
Existing multimodal recipe datasets lack fine-grained alignment among recipe goals, step-wise instructions, and visual content, hindering progress in culinary education and multimodal cooking assistants. To address this, we introduce CookBench—the first multimodal recipe generation benchmark grounded in real-world cooking scenarios—comprising over 26,000 recipes, 196,000+ images, and 4,491 videos, with the first step-level cross-modal alignment. We propose domain-specific evaluation metrics focusing on ingredient fidelity and procedural interaction modeling, and establish a unified benchmarking framework covering three generative tasks: text-to-image, image-to-video, and text-to-video. Comprehensive evaluation of state-of-the-art models reveals critical bottlenecks in ingredient consistency and temporal action modeling. CookBench provides a reproducible, scalable multimodal research infrastructure for food computing.
📝 Abstract
Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.