The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Text-based recipe generation lacks rigorous benchmarks, and generic tokenizers fail to preserve numerical quantities and structural information critical for recipes. Method: This paper introduces RecipeDB-5—the first standardized benchmark for recipe generation—and proposes a domain-specific tokenization strategy: expanding the vocabulary with 23 fractional symbols and structural markers to explicitly encode ingredient quantities, step logic, and other key culinary features. Contribution/Results: We systematically evaluate GPT-2 (large/small) and LSTM/RNN models across five cuisine categories using seven metrics—BLEU-4, METEOR, ROUGE-L, and BERTScore, among others. Fine-tuned GPT-2 significantly outperforms the best RNN baseline: BERTScore-F1 improves by 20.8% (0.92 vs. 0.72), and perplexity drops by 69.8%. These results demonstrate that structure-aware tokenization synergizes effectively with large language models to advance recipe generation performance.

Technology Category

Application Category

📝 Abstract

We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.

Problem

Research questions and friction points this paper is trying to address.

Establishing benchmark for text-based recipe generation

Addressing tokenization limitations in preserving recipe structures

Enhancing domain specificity with custom vocabulary tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned GPT-2 large model for recipes

Targeted tokenization with fraction tokens

Comprehensive evaluation using seven metrics

🔎 Similar Papers

No similar papers found.