🤖 AI Summary
Existing food volume estimation methods rely on specialized hardware (e.g., 3D scanners), depth sensors, or reference-object calibration—factors that hinder portability, calibration-free operation, and clinical-grade accuracy required in medical nutrition management. To address this, we propose the first reference-free, depth-free 3D food reconstruction and volumetric estimation framework tailored for off-the-shelf AR-enabled smartphones. Our method jointly leverages structure-from-motion (SfM) to reconstruct sparse point clouds, temporally consistent food instance segmentation from video, and geometrically constrained voxelization to enable end-to-end volume computation. To support robust segmentation under real-world complexity, we introduce the first large-scale food video segmentation dataset covering diverse, challenging scenarios. Evaluated across multiple benchmarks, our approach achieves a mean absolute percentage error (MAPE) of 2.22%, substantially outperforming prior art and enabling portable, calibration-free, clinically accurate nutritional assessment.
📝 Abstract
Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single-purpose hardware such as 3D scanners, gathering sensor-oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device-driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR-capable mobile devices. To achieve real-world measurement, VolE is a reference- and depth-free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.