Application of machine learning to predict food processing level using Open Food Facts

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

195K/year
🤖 AI Summary
This study addresses the automatic classification of food products according to the NOVA processing taxonomy and its health-related implications. Leveraging over 900,000 entries from Open Food Facts, we construct the largest publicly available NOVA-labeled training dataset to date. We propose a nutrient concentration–based feature engineering framework and employ ensemble learning models—including LightGBM, Random Forest, and CatBoost—for four-class NOVA classification; LightGBM achieves 80–85% accuracy, substantially outperforming baseline methods. We present the first systematic empirical analysis revealing strong associations between NOVA categories and established metrics: Nutri-Score, carbon footprint, Eco-Score, and allergen prevalence. Furthermore, we develop and open-source an interactive web tool for NOVA prediction, enabling integrated assessment across health, environmental sustainability, and allergenic risk dimensions. This work provides a reproducible methodological foundation and empirical evidence to advance epidemiological research on ultra-processed foods.

Technology Category

Application Category

📝 Abstract
Ultra-processed foods are increasingly linked to health issues like obesity, cardiovascular disease, type 2 diabetes, and mental health disorders due to poor nutritional quality. This first-of-its-kind study at such a scale uses machine learning to classify food processing levels (NOVA) based on the Open Food Facts dataset of over 900,000 products. Models including LightGBM, Random Forest, and CatBoost were trained on nutrient concentration data. LightGBM performed best, achieving 80-85% accuracy across different nutrient panels and effectively distinguishing minimally from ultra-processed foods. Exploratory analysis revealed strong associations between higher NOVA classes and lower Nutri-Scores, indicating poorer nutritional quality. Products in NOVA 3 and 4 also had higher carbon footprints and lower Eco-Scores, suggesting greater environmental impact. Allergen analysis identified gluten and milk as common in ultra-processed items, posing risks to sensitive individuals. Categories like Cakes and Snacks were dominant in higher NOVA classes, which also had more additives, highlighting the role of ingredient modification. This study, leveraging the largest dataset of NOVA-labeled products, emphasizes the health, environmental, and allergenic implications of food processing and showcases machine learning's value in scalable classification. A user-friendly web tool is available for NOVA prediction using nutrient data: https://cosylab.iiitd.edu.in/foodlabel/.
Problem

Research questions and friction points this paper is trying to address.

Predict food processing levels using machine learning
Analyze health and environmental impacts of processed foods
Classify NOVA categories with nutrient data accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning classifies food processing levels using nutrient data
LightGBM model achieves 80-85% accuracy in distinguishing processing categories
Web tool enables scalable NOVA prediction from Open Food Facts dataset
🔎 Similar Papers
No similar papers found.
N
Nalin Arora
Department of Computational Biology, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
A
Aviral Chauhan
Department of Computer Science, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
S
Siddhant Rana
Department of Computer Science, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
M
Mahansh Aditya
Human Centered Design, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
S
Sumit Bhagat
Department of Computer Science, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
A
Aditya Kumar
Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Department of Mathematics, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
A
Akash Kumar
Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Department of Mathematics, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
A
Akanksh Semar
Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Department of Electronics & Communications Engineering, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
A
Ayush Vikram Singh
Department of Computer Science, Infosys Center for Artificial Intelligence, Center of Excellence in Healthcare, Indraprastha Institute of Information Technology Delhi (IIIT-Delhi), New Delhi 110020 India
Ganesh Bagler
Ganesh Bagler
IIIT Delhi
Complex SystemsComputational BiologyComputational GastronomyBioinformaticsNetwork Science