RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing multimodal recipe datasets lack fine-grained alignment among recipe goals, step-wise instructions, and visual content, hindering progress in culinary education and multimodal cooking assistants. To address this, we introduce CookBench—the first multimodal recipe generation benchmark grounded in real-world cooking scenarios—comprising over 26,000 recipes, 196,000+ images, and 4,491 videos, with the first step-level cross-modal alignment. We propose domain-specific evaluation metrics focusing on ingredient fidelity and procedural interaction modeling, and establish a unified benchmarking framework covering three generative tasks: text-to-image, image-to-video, and text-to-video. Comprehensive evaluation of state-of-the-art models reveals critical bottlenecks in ingredient consistency and temporal action modeling. CookBench provides a reproducible, scalable multimodal research infrastructure for food computing.

Technology Category

Application Category

📝 Abstract

Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.

Problem

Research questions and friction points this paper is trying to address.

Lack fine-grained alignment in recipe datasets

Need for real-world multimodal recipe generation benchmark

Absence of domain-specific evaluation metrics for recipes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for recipe generation

Fine-grained alignment of text and visuals

Domain-specific metrics for evaluation

🔎 Similar Papers

No similar papers found.