🤖 AI Summary
In autonomous scientific discovery laboratories (SDLs), erroneous input parameter capture introduces feature noise, severely compromising model reliability and experimental reproducibility. To address this, we propose a model-agnostic framework for detecting and recovering noisy features, integrating k-nearest neighbors (kNN) imputation with systematic sensitivity analysis. Our approach quantitatively characterizes how data scale, noise intensity, and feature distribution—categorized by continuity (continuous vs. discrete) and dispersion (broad vs. narrow domain)—affect detection rates and recovery accuracy. We establish, for the first time, a repairability benchmark stratified by feature type, revealing that higher noise intensity and larger datasets improve recoverability; continuous or broadly dispersed features are more amenable to repair; and a modest number of high-quality samples can effectively compensate for low-intensity noise. The framework significantly enhances data quality and modeling robustness in SDLs, providing a principled foundation for reliable data governance in automated materials discovery.
📝 Abstract
Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials data sets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.