V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing methods for nutrition estimation from a single plated food image struggle to accurately identify visually ambiguous key ingredients—such as oils and sauces—after cooking, leading to substantial errors in calorie and macronutrient predictions. This work proposes the first dish-level nutritional estimation framework leveraging first-person cooking videos. It employs a VideoMamba-based ingredient addition event detection module to select informative cooking frames and integrates features from both the final dish image and these process frames within a lightweight multi-frame fusion architecture. Experiments on the HD-EPIC dataset demonstrate that cooking-process cues provide complementary nutritional evidence, significantly improving estimation accuracy under controlled conditions. Performance is highly dependent on the representational capacity of the visual backbone and the precision of event detection.

Technology Category

Application Category

📝 Abstract

Nutrition estimation of meals from visual data is an important problem for dietary monitoring and computational health, but existing approaches largely rely on single images of the finally completed dish. This setting is fundamentally limited because many nutritionally relevant ingredients and transformations, such as oils, sauces, and mixed components, become visually ambiguous after cooking, making accurate calorie and macronutrient estimation difficult. In this paper, we investigate whether the cooking process information from egocentric cooking videos can contribute to dish-level nutrition estimation. First, we further manually annotated the HD-EPIC dataset and established the first benchmark for video-based nutrition estimation. Most importantly, we propose V-Nutri, a staged framework that combines Nutrition5K-pretrained visual backbones with a lightweight fusion module that aggregates features from the final dish frame and cooking process keyframes extracted from the egocentric videos. V-Nutri also includes a cooking keyframes selection module, a VideoMamba-based event-detection model that targets ingredient-addition moments. Experiments on the HD-EPIC dataset show that process cues can provide complementary nutritional evidence, improving nutrition estimation under controlled conditions. Our results further indicate that the benefit of process keyframes depends strongly on backbone representation capacity and event detection quality. Our code and annotated dataset is available at https://github.com/K624-YCK/V-Nutri.

Problem

Research questions and friction points this paper is trying to address.

nutrition estimation

egocentric cooking videos

dish-level

visual ambiguity

dietary monitoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric video

nutrition estimation

cooking process modeling