HUGO-CS: A Hybrid-Labeled, Uncertainty-Aware, General-Purpose, Observational Dataset for Cold Spray

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

227K/year
📝 Abstract
Cold spraying is an increasingly common approach for repairing and manufacturing components due to its solid-state manufacturing capabilities. However, process optimization remains difficult due to many interdependent parameters and the lack of large-scale, machine-readable data to support modeling. While the scientific literature contains many relevant experiments, results are inconsistently reported (often in tables and figures) and use non-uniform units, limiting utilization at scale. To address these limitations, this work presents HUGO-CS, a literature-derived dataset of 4,383 cold-spray experiments with 144 features from 1,124 sources, exceeding the previous largest dataset (137 samples) by 30x. With completely manual extraction requiring an average of 91 minutes per document, this work designs and leverages a Hybrid-labeled, Uncertainty-aware, General-purpose, Observational extraction framework, called HUGO, to support this extraction. HUGO combines automated LLM-based labeling with targeted manual label refinement to handle this experimental result extraction process from scientific literature. To balance labeling efficiency with extraction accuracy, HUGO introduces a Hierarchical Risk Mitigation (HRM) to route LLM outputs with a high risk of potential errors for manual review, while retaining low-risk records as auto-labeled. Lastly, HUGO post-processing consolidates categorical descriptors, maps reported feedstock chemistries into structured continuous compositions, and normalizes units across sources. Of the 4,383 reported experiments, 1,765 are hand-labeled, providing a high-quality labeled subset for benchmarking, error analysis, and higher-fidelity data points. All code to replicate this work, along with the complete HUGO-CS dataset, are released under a CC-BY license at https://github.com/sprice134/HUGO.
Problem

Research questions and friction points this paper is trying to address.

cold spray
process optimization
data standardization
machine-readable data
literature-derived dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid labeling
Uncertainty-aware extraction
Hierarchical Risk Mitigation
Cold spray dataset
LLM-assisted data curation
🔎 Similar Papers
No similar papers found.
Stephen Price
Stephen Price
Los Alamos National Laboratory
glaciologyice sheet modelingclimate changeclimate modeling
K
Kyle Miller
Citrine Informatics, Redwood City, CA, USA
M
Marco Musto
Citrine Informatics, Redwood City, CA, USA
K
Kenneth Kroenlein
Citrine Informatics, Redwood City, CA, USA
J
James Saal
Citrine Informatics, Redwood City, CA, USA
K
Kyle Tsaknopoulos
Materials and Manufacturing Department, Worcester Polytechnic Institute, Worcester, MA, USA
E
Elke A. Rundensteiner
Data Science Department, Worcester Polytechnic Institute, Worcester, MA, USA
D
Danielle L. Cote
Materials and Manufacturing Department, Worcester Polytechnic Institute, Worcester, MA, USA