Surprisingly High Redundancy in Electronic Structure Data

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Conventional machine learning for electronic structure prediction assumes that high accuracy and generalizability necessitate large, exhaustive datasets—a premise challenged by unexplored data redundancy. Method: We systematically compare random pruning, coverage-driven pruning, and importance-based pruning on diverse Kohn–Sham density functional theory datasets—including molecules, simple metals, and complex alloys—to quantify intrinsic redundancy and identify highly representative subsets. Contribution/Results: We demonstrate that a subset comprising only 1% of the original data achieves chemical accuracy (<0.01 eV/atom) and robust generalization across material classes. Training time is reduced by over threefold. This work provides the first quantitative characterization and systematic exploitation of redundancy in electronic structure data, introducing a “small-but-informative” data paradigm. It establishes a scalable, computationally efficient foundation for first-principles machine learning models without sacrificing predictive fidelity.

Technology Category

Application Category

📝 Abstract

Machine Learning (ML) models for electronic structure rely on large datasets generated through expensive Kohn-Sham Density Functional Theory simulations. This study reveals a surprisingly high level of redundancy in such datasets across various material systems, including molecules, simple metals, and complex alloys. Our findings challenge the prevailing assumption that large, exhaustive datasets are necessary for accurate ML predictions of electronic structure. We demonstrate that even random pruning can substantially reduce dataset size with minimal loss in predictive accuracy, while a state-of-the-art coverage-based pruning strategy retains chemical accuracy and model generalizability using up to 100-fold less data and reducing training time by threefold or more. By contrast, widely used importance-based pruning methods, which eliminate seemingly redundant data, can catastrophically fail at higher pruning factors, possibly due to the significant reduction in data coverage. This heretofore unexplored high degree of redundancy in electronic structure data holds the potential to identify a minimal, essential dataset representative of each material class.

Problem

Research questions and friction points this paper is trying to address.

High redundancy in electronic structure data challenges large dataset necessity

Random pruning reduces dataset size with minimal accuracy loss

Coverage-based pruning maintains accuracy with significantly less data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Random pruning reduces dataset size effectively

Coverage-based pruning retains accuracy with less data

High redundancy challenges need for large datasets

🔎 Similar Papers

No similar papers found.