FoodCHA: Multi-Modal LLM Agent for Fine-Grained Food Analysis

πŸ“… 2026-05-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
This work addresses the challenge of inaccurate dietary monitoring in real-world scenarios, where food images exhibit high intra-class similarity, frequent co-occurrence of multiple foods, and diverse cooking styles. To tackle this, the authors propose a hierarchical reasoning framework based on multimodal agents that follows a cascaded decision pathβ€”from coarse category to fine-grained subclass and then to cooking method. The approach integrates an innovative hierarchical anchoring mechanism with a lightweight vision-language model (Moondream-2B) to enable structured prediction of fine-grained food attributes. Evaluated on the FoodNExTDB dataset, the method outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition accuracy, respectively, and achieves a 153.2% improvement in cooking method classification, significantly enhancing semantic consistency and practical deployability.
πŸ“ Abstract
The widespread adoption of camera-equipped mobile devices and wearables has enabled convenient capture of meal images, making food recognition a key component for real time dietary monitoring. However, real-world food images present challenges due to high intra-class similarity and the frequent presence of multiple food items within a single image. While deep learning models achieve strong performance in coarse grained classification, they often struggle to capture fine-grained attributes such as cooking style. Moreover, open-ended generation in modern vision-language models can produce non-canonical labels, limiting their practical deployment. We propose FoodCHA, a multimodal agentic framework that reformulates food recognition as a hierarchical decision-making process. By progressively anchoring predictions, FoodCHA guides subcategory identification using high-level categories and guides cooking style recognition using subcategories, improving semantic consistency and attribute-level discrimination. To ensure practical deployability, FoodCHA utilizes the compact Moondream-2B vision language model, which provides strong reasoning capability while maintaining lower computational and memory overhead. Experiments on FoodNExTDB show that FoodCHA outperforms Food-Llama-3.2-11B by 13.8% and 38.2% in category and subcategory recognition precision, respectively, and achieves a striking 153.2% improvement in cooking style classification precision.
Problem

Research questions and friction points this paper is trying to address.

fine-grained food analysis
food recognition
cooking style classification
multi-food images
non-canonical labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agentic framework
hierarchical decision-making
fine-grained food analysis
cooking style recognition
compact vision-language model
πŸ”Ž Similar Papers
No similar papers found.