Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address low accuracy in food nutritional estimation under nutrition label–absent scenarios, this paper introduces FastFood—the first large-scale fast-food image dataset with fine-grained nutritional annotations—and proposes a model-agnostic Visual–Ingredient Feature Fusion (VIF²) method. VIF² features three key innovations: (1) an ingredient-aware dual-path collaborative modeling framework; (2) a robust ingredient training strategy leveraging synonym substitution and resampling; and (3) a large-language-model–driven post-hoc ingredient prediction refinement mechanism. Evaluated across diverse backbones—including ResNet, InceptionV3, ViT—and multimodal models such as LLaVA, VIF² achieves significant improvements over state-of-the-art methods on both FastFood and Nutrition5k, reducing mean absolute error (MAE) in nutrient value estimation by 18.7% on average. These results empirically validate that explicit ingredient information provides critical, measurable gains for nutritional estimation.

Technology Category

Application Category

📝 Abstract

Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF$^2$) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. https://huiyanqi.github.io/fastfood-nutrition-estimation/.

Problem

Research questions and friction points this paper is trying to address.

Lack of datasets with nutritional annotations limits nutrition estimation progress

Proposing Visual-Ingredient Feature Fusion to enhance nutrition estimation accuracy

Improving ingredient robustness via synonym replacement and resampling strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual-Ingredient Feature Fusion (VIFF) method

Synonym replacement and resampling strategies

Ingredient-aware visual feature fusion module

🔎 Similar Papers

No similar papers found.