Training-Free Text-to-Image Compositional Food Generation via Prompt Grafting

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenge of object entanglement—such as the blending of rice and soup due to ambiguous boundaries—in existing text-to-image diffusion models when generating multi-food scenes. To overcome this, the authors propose Prompt Grafting, a training-free framework that operates in two stages: first, spatial regions are established via layout-aware prompts; second, target food prompts are grafted onto these stable layouts to enable controllable generation of multi-food compositions. Notably, this approach allows flexible control over the separation or mixing of food items through prompt editing alone, supporting user-defined spatial arrangements. Evaluated on two food datasets, the method significantly improves the presence accuracy of target foods and demonstrates promising applications in dietary assessment and recipe visualization.

Technology Category

Application Category

📝 Abstract

Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.

Problem

Research questions and friction points this paper is trying to address.

compositional food generation

object entanglement

text-to-image diffusion models

multi-food image generation

food boundary ambiguity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Grafting

training-free

compositional food generation