LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Food image recognition faces three major challenges: domain shift, long-tailed class distribution, and fine-grained category ambiguity. To address them jointly, this paper proposes the first large language model (LLM)-based multimodal enhancement framework. Methodologically, an LLM performs semantic parsing of input food images to generate descriptive text—including dish names and key ingredients—followed by vision–language contrastive learning that maps multi-source domain images and their corresponding LLM-generated texts into a unified cross-modal embedding space, enabling domain-invariant feature alignment and semantics-guided discriminative representation learning. Crucially, this work is the first to deeply integrate LLMs into the end-to-end food visual recognition pipeline, simultaneously tackling domain adaptation, long-tailed learning, and fine-grained classification. Extensive experiments on two benchmark food datasets demonstrate that our method consistently outperforms state-of-the-art approaches specialized for any single subtask, achieving significant improvements in overall recognition accuracy.

Technology Category

Application Category

📝 Abstract

Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing domain shift between internet and real-world food images

Solving long-tailed data distribution in food recognition datasets

Distinguishing visually similar food categories with subtle variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate food titles and ingredients

Project texts and images to shared embedding space

Align multimodal features for food recognition

🔎 Similar Papers

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks