Surprisingly High Redundancy in Electronic Structure Data

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional machine learning for electronic structure prediction assumes that high accuracy and generalizability necessitate large, exhaustive datasets—a premise challenged by unexplored data redundancy. Method: We systematically compare random pruning, coverage-driven pruning, and importance-based pruning on diverse Kohn–Sham density functional theory datasets—including molecules, simple metals, and complex alloys—to quantify intrinsic redundancy and identify highly representative subsets. Contribution/Results: We demonstrate that a subset comprising only 1% of the original data achieves chemical accuracy (<0.01 eV/atom) and robust generalization across material classes. Training time is reduced by over threefold. This work provides the first quantitative characterization and systematic exploitation of redundancy in electronic structure data, introducing a “small-but-informative” data paradigm. It establishes a scalable, computationally efficient foundation for first-principles machine learning models without sacrificing predictive fidelity.

Technology Category

Application Category

📝 Abstract
Machine Learning (ML) models for electronic structure rely on large datasets generated through expensive Kohn-Sham Density Functional Theory simulations. This study reveals a surprisingly high level of redundancy in such datasets across various material systems, including molecules, simple metals, and complex alloys. Our findings challenge the prevailing assumption that large, exhaustive datasets are necessary for accurate ML predictions of electronic structure. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy, while a state-of-the-art coverage-based pruning strategy retains chemical accuracy and model generalizability using up to 100-fold less data and reducing training time by threefold or more. By contrast, widely used importance-based pruning methods, which eliminate seemingly redundant data, can catastrophically fail at higher pruning factors, possibly due to the significant reduction in data coverage. This heretofore unexplored high degree of redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.
Problem

Research questions and friction points this paper is trying to address.

High redundancy in electronic structure data challenges large dataset necessity
Random pruning reduces dataset size with minimal accuracy loss
Coverage-based pruning maintains accuracy with significantly less data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Random pruning reduces dataset size effectively
Coverage-based pruning retains accuracy with less data
High redundancy challenges need for large datasets
🔎 Similar Papers
No similar papers found.
Sazzad Hossain
Sazzad Hossain
Samarkand State University, Samarkand, Uzbekistan
Quantum computingIoTArtificial IntelligienceInformation SecurityExpert Systems
Ponkrshnan Thiagarajan
Ponkrshnan Thiagarajan
Johns Hopkins University
Uncertainty quantificationBayesian methodsMachine learningComputational Mechanics
S
Shashank Pathrudnar
Department of Mechanical and Aerospace Engineering, Michigan Technological University
S
Stephanie Taylor
Department of Materials Science and Engineering, University of California, Los Angeles
A
Abhijeet Sadashiv Gangan
Department of Materials Science and Engineering, University of California, Los Angeles
A
Amartya S. Banerjee
Department of Materials Science and Engineering, University of California, Los Angeles
Susanta Ghosh
Susanta Ghosh
Assistant Professor, Mechanical and Aerospace Engineering, Michigan Technological University
Scientific Machine LearningBayesian approaches in Machine LearningMultiscale modeling