🤖 AI Summary
Current automated nutritional analysis is hindered by inconsistent evaluation criteria and the absence of real-world benchmark datasets. To address this, we introduce JFB—the first application-oriented, high-quality multimodal food benchmark comprising 1,000 real-world food images—accompanied by a standardized evaluation framework and a novel holistic scoring mechanism, thereby filling a critical gap in the field’s assessment infrastructure. We propose a robustness-aware evaluation metric grounded in vision-language models (VLMs) and a hierarchical testing protocol, and develop a dedicated model, january/food-vision-v1. On JFB, our model achieves an overall score of 86.2, outperforming the best general-purpose model by 12.1 points. This demonstrates the effectiveness and advancement of our benchmark, evaluation framework, and model design.
📝 Abstract
Progress in AI for automated nutritional analysis is critically hampered by the lack of standardized evaluation methodologies and high-quality, real-world benchmark datasets. To address this, we introduce three primary contributions. First, we present the January Food Benchmark (JFB), a publicly available collection of 1,000 food images with human-validated annotations. Second, we detail a comprehensive benchmarking framework, including robust metrics and a novel, application-oriented overall score designed to assess model performance holistically. Third, we provide baseline results from both general-purpose Vision-Language Models (VLMs) and our own specialized model, january/food-vision-v1. Our evaluation demonstrates that the specialized model achieves an Overall Score of 86.2, a 12.1-point improvement over the best-performing general-purpose configuration. This work offers the research community a valuable new evaluation dataset and a rigorous framework to guide and benchmark future developments in automated nutritional analysis.