LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Food image recognition faces three major challenges: domain shift, long-tailed class distribution, and fine-grained category ambiguity. To address them jointly, this paper proposes the first large language model (LLM)-based multimodal enhancement framework. Methodologically, an LLM performs semantic parsing of input food images to generate descriptive text—including dish names and key ingredients—followed by vision–language contrastive learning that maps multi-source domain images and their corresponding LLM-generated texts into a unified cross-modal embedding space, enabling domain-invariant feature alignment and semantics-guided discriminative representation learning. Crucially, this work is the first to deeply integrate LLMs into the end-to-end food visual recognition pipeline, simultaneously tackling domain adaptation, long-tailed learning, and fine-grained classification. Extensive experiments on two benchmark food datasets demonstrate that our method consistently outperforms state-of-the-art approaches specialized for any single subtask, achieving significant improvements in overall recognition accuracy.

Technology Category

Application Category

📝 Abstract
Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.
Problem

Research questions and friction points this paper is trying to address.

Addressing domain shift between internet and real-world food images
Solving long-tailed data distribution in food recognition datasets
Distinguishing visually similar food categories with subtle variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate food titles and ingredients
Project texts and images to shared embedding space
Align multimodal features for food recognition
🔎 Similar Papers
No similar papers found.
Q
Qing Wang
School of Computing and Information Systems, Singapore Management University, Singapore
Chong-Wah Ngo
Chong-Wah Ngo
Singapore Management University
MultimediaFood ComputingComputer VisionInformation Retrieval
Ee-Peng Lim
Ee-Peng Lim
Singapore Management University
Data and Text MiningSocial Network MiningInformation IntegrationDigital Libraries
Q
Qianru Sun
School of Computing and Information Systems, Singapore Management University, Singapore