CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

📅 2025-10-27
🏛️ Proceedings of the 33rd ACM International Conference on Multimedia
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models struggle to generate image sequences aligned with multi-step cooking instructions: they produce fixed-length outputs, lack adaptability to recipe structural variations, and exhibit poor cross-step visual consistency. To address these limitations, we propose three novel mechanisms—Step-wise Regional Control, Flexible RoPE, and Cross-Step Consistency Control—that enable region-level inter-step alignment, dynamic-length positional encoding, and modeling of cross-step consistency in ingredients and style, respectively. Built upon a diffusion framework, our method supports arbitrary instruction lengths without fine-tuning, maintaining high-fidelity generation in both fully trained and zero-shot settings. On a dedicated recipe-image generation benchmark, it significantly outperforms state-of-the-art methods, achieving—for the first time—semantically coherent, structurally adaptive, and visually consistent visualization of multi-step culinary instructions. This advances controllable image generation toward structured instruction understanding.

Technology Category

Application Category

📝 Abstract
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation. More details are at https://github.com/zhangdaxia22/CookAnything.
Problem

Research questions and friction points this paper is trying to address.

Generates coherent image sequences from variable-length cooking instructions
Maintains ingredient consistency across multiple recipe steps
Enhances temporal coherence and spatial diversity in recipe visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-wise Regional Control aligns text steps with image regions
Flexible RoPE enhances temporal coherence and spatial diversity
Cross-Step Consistency Control maintains ingredient consistency across steps
🔎 Similar Papers
No similar papers found.