Training-Free Text-to-Image Compositional Food Generation via Prompt Grafting

📅 2026-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of object entanglement—such as the blending of rice and soup due to ambiguous boundaries—in existing text-to-image diffusion models when generating multi-food scenes. To overcome this, the authors propose Prompt Grafting, a training-free framework that operates in two stages: first, spatial regions are established via layout-aware prompts; second, target food prompts are grafted onto these stable layouts to enable controllable generation of multi-food compositions. Notably, this approach allows flexible control over the separation or mixing of food items through prompt editing alone, supporting user-defined spatial arrangements. Evaluated on two food datasets, the method significantly improves the presence accuracy of target foods and demonstrates promising applications in dietary assessment and recipe visualization.

Technology Category

Application Category

📝 Abstract
Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.
Problem

Research questions and friction points this paper is trying to address.

compositional food generation
object entanglement
text-to-image diffusion models
multi-food image generation
food boundary ambiguity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Grafting
training-free
compositional food generation
object entanglement
layout guidance
🔎 Similar Papers
No similar papers found.
X
Xinyue Pan
Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
Yuhao Chen
Yuhao Chen
University of Waterloo
Computer VisionFood computingRobotic visionPrecision NutritionImage-Base Plant Phenotyping
F
Fengqing Zhu
Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA