Evaluating Large Multimodal Models for Nutrition Analysis: A Benchmark Enriched with Contextual Metadata

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the insufficient accuracy of large multimodal models (LMMs) in nutritional analysis. We propose a context-aware framework that integrates lightweight contextual metadata—including GPS-derived location and food-type information, time-derived meal and day-type indicators, and food category labels—with multi-strategy reasoning enhancement (chain-of-thought prompting, few-shot learning, and expert-role prompting). We introduce ACETADA, the first open-source, context-annotated dataset for nutritional analysis, and systematically evaluate eight state-of-the-art LMMs. Our experiments reveal a synergistic effect between contextual metadata and reasoning modifiers, significantly reducing mean absolute error (MAE) and mean absolute percentage error (MAPE) in calorie and macronutrient estimation. The core contribution is the empirical validation that lightweight contextual encoding substantially improves the robustness of LMMs for nutritional estimation, while also filling a critical gap in open, multi-model, context-driven evaluation benchmarks for nutritional analysis.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce ACETADA, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LMMs for nutrition analysis with contextual metadata

Exploring impact of metadata on nutritional value estimation accuracy

Introducing ACETADA dataset for open benchmark and analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates GPS, timestamps, food items metadata

Introduces ACETADA open food-image dataset

Enhances LMMs with contextual reasoning modifiers

🔎 Similar Papers

No similar papers found.