🤖 AI Summary
Conventional machine learning for electronic structure prediction assumes that high accuracy and generalizability necessitate large, exhaustive datasets—a premise challenged by unexplored data redundancy. Method: We systematically compare random pruning, coverage-driven pruning, and importance-based pruning on diverse Kohn–Sham density functional theory datasets—including molecules, simple metals, and complex alloys—to quantify intrinsic redundancy and identify highly representative subsets. Contribution/Results: We demonstrate that a subset comprising only 1% of the original data achieves chemical accuracy (<0.01 eV/atom) and robust generalization across material classes. Training time is reduced by over threefold. This work provides the first quantitative characterization and systematic exploitation of redundancy in electronic structure data, introducing a “small-but-informative” data paradigm. It establishes a scalable, computationally efficient foundation for first-principles machine learning models without sacrificing predictive fidelity.
📝 Abstract
Machine Learning (ML) models for electronic structure rely on large datasets generated through expensive Kohn-Sham Density Functional Theory simulations. This study reveals a surprisingly high level of redundancy in such datasets across various material systems, including molecules, simple metals, and complex alloys. Our findings challenge the prevailing assumption that large, exhaustive datasets are necessary for accurate ML predictions of electronic structure. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy, while a state-of-the-art coverage-based pruning strategy retains chemical accuracy and model generalizability using up to 100-fold less data and reducing training time by threefold or more. By contrast, widely used importance-based pruning methods, which eliminate seemingly redundant data, can catastrophically fail at higher pruning factors, possibly due to the significant reduction in data coverage. This heretofore unexplored high degree of redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.