Food Image Generation on Multi-Noun Categories

πŸ“… 2025-12-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multi-noun food categories (e.g., β€œegg noodles”) are frequently misparsed as independent lexical units in text-to-image generation, leading to semantic distortion and spatial misalignment. To address this, we propose FoCULR (Food Category Understanding and Layout Refinement), the first framework to embed a food knowledge graph into the text encoder. It explicitly models semantic dependencies among multi-noun constituents via a relation-decoupling module and enforces structural fidelity through a novel spatial layout constraint loss. Built upon diffusion models, FoCULR achieves significant improvements on benchmarks including UEC-256: FID and CLIP-Score improve markedly; accuracy for multi-noun food categories increases by 32.7%; and spurious ingredient generation decreases by 58.4%. This work establishes an interpretable, controllable, and domain-adaptive paradigm for fine-grained food image synthesis.

Technology Category

Application Category

πŸ“ Abstract
Generating realistic food images for categories with multiple nouns is surprisingly challenging. For instance, the prompt "egg noodle" may result in images that incorrectly contain both eggs and noodles as separate entities. Multi-noun food categories are common in real-world datasets and account for a large portion of entries in benchmarks such as UEC-256. These compound names often cause generative models to misinterpret the semantics, producing unintended ingredients or objects. This is due to insufficient multi-noun category related knowledge in the text encoder and misinterpretation of multi-noun relationships, leading to incorrect spatial layouts. To overcome these challenges, we propose FoCULR (Food Category Understanding and Layout Refinement) which incorporates food domain knowledge and introduces core concepts early in the generation process. Experimental results demonstrate that the integration of these techniques improves image generation performance in the food domain.
Problem

Research questions and friction points this paper is trying to address.

Generates realistic images for multi-noun food categories
Addresses misinterpretation of compound names in generative models
Improves spatial layouts by incorporating food domain knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates food domain knowledge early
Refines spatial layouts for multi-noun categories
Enhances text encoder understanding of relationships
πŸ”Ž Similar Papers
No similar papers found.
X
Xinyue Pan
Purdue University
Y
Yuhao Chen
University of Waterloo
Jiangpeng He
Jiangpeng He
Purdue University
Computer VisionDeep Learning
F
Fengqing Zhu
Purdue University