🤖 AI Summary
Polymer science has long suffered from a scarcity of large-scale, open-access data, hindering AI-driven innovation. Method: We introduce PolyOmics—the largest publicly available polymer molecular dynamics simulation database to date (>100,000 polymers)—generated via a fully automated high-throughput simulation pipeline and leveraged within a pretrain-fine-tune machine learning framework. We propose and empirically validate a “simulation-to-reality” transfer learning paradigm for polymer property prediction. Contribution/Results: Systematic experiments reveal a power-law scaling relationship between database size and model generalization performance, providing empirical support for data-driven scientific discovery. PolyOmics significantly improves prediction accuracy under low-data regimes, enabling robust property estimation with limited experimental samples. This advancement bridges the gap between academic AI research and industrial polymer development, facilitating rapid, data-informed materials design and accelerating translation into real-world applications.
📝 Abstract
Developing large-scale foundational datasets is a critical milestone in advancing artificial intelligence (AI)-driven scientific innovation. However, unlike AI-mature fields such as natural language processing, materials science, particularly polymer research, has significantly lagged in developing extensive open datasets. This lag is primarily due to the high costs of polymer synthesis and property measurements, along with the vastness and complexity of the chemical space. This study presents PolyOmics, an omics-scale computational database generated through fully automated molecular dynamics simulation pipelines that provide diverse physical properties for over $10^5$ polymeric materials. The PolyOmics database is collaboratively developed by approximately 260 researchers from 48 institutions to bridge the gap between academia and industry. Machine learning models pretrained on PolyOmics can be efficiently fine-tuned for a wide range of real-world downstream tasks, even when only limited experimental data are available. Notably, the generalisation capability of these simulation-to-real transfer models improve significantly as the size of the PolyOmics database increases, exhibiting power-law scaling. The emergence of scaling laws supports the "more is better" principle, highlighting the significance of ultralarge-scale computational materials data for improving real-world prediction performance. This unprecedented omics-scale database reveals vast unexplored regions of polymer materials, providing a foundation for AI-driven polymer science.