We Need Improved Data Curation and Attribution in AI for Scientific Discovery

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In AI-driven scientific discovery, the proliferation of synthetic data coupled with diminishing traceability of real experimental data is precipitating crises in data integrity and model stability. We find that 74% of publicly available experimental data on open platforms exhibits low adoption rates, and the boundary between real and synthetic data is increasingly blurred. Method: We propose “active watermarking”—a paradigm shift from passive detection to proactive annotation—where low-coverage (<50%) watermarks embedded exclusively in real data significantly enhance AI model robustness. Our framework integrates watermark embedding, synthetic-data identification, data adoption-rate modeling, and lineage tracing. Contribution/Results: Quantitative evaluation demonstrates that active watermarking improves model resilience against synthetic-data contamination while enabling verifiable attribution. This work establishes a new standard for trustworthy, auditable, and attributable data governance in scientific AI, advancing reproducibility and accountability in data-intensive research.

Technology Category

Application Category

📝 Abstract
As the interplay between human-generated and synthetic data evolves, new challenges arise in scientific discovery concerning the integrity of the data and the stability of the models. In this work, we examine the role of synthetic data as opposed to that of real experimental data for scientific research. Our analyses indicate that nearly three-quarters of experimental datasets available on open-access platforms have relatively low adoption rates, opening new opportunities to enhance their discoverability and usability by automated methods. Additionally, we observe an increasing difficulty in distinguishing synthetic from real experimental data. We propose supplementing ongoing efforts in automating synthetic data detection by increasing the focus on watermarking real experimental data, thereby strengthening data traceability and integrity. Our estimates suggest that watermarking even less than half of the real world data generated annually could help sustain model robustness, while promoting a balanced integration of synthetic and human-generated content.
Problem

Research questions and friction points this paper is trying to address.

Enhancing data integrity in AI-driven scientific discovery
Improving discoverability of low-adoption experimental datasets
Strengthening traceability by watermarking real experimental data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhancing data discoverability via automated methods
Watermarking real data to ensure traceability
Balancing synthetic and real data integration
🔎 Similar Papers
No similar papers found.
M
Mara Graziani
IBM Research Europe, Zürich, Switzerland; NCCR Catalysis, Switzerland
A
Antonio Foncubierta
IBM Research Europe, Zürich, Switzerland
D
Dimitrios Christofidellis
IBM Research Europe, Zürich, Switzerland
I
Irina Espejo-Morales
IBM Research Europe, Zürich, Switzerland
M
Malina Molnar
IBM Research Europe, Zürich, Switzerland; NCCR Catalysis, Switzerland
Marvin Alberts
Marvin Alberts
IBM Research
ML for ChemistryAnalytical ChemistryAccelerated Discovery
Matteo Manica
Matteo Manica
IBM Research
Accelerated DiscoveryArtificial IntelligenceMachine learningDeep learning
Jannis Born
Jannis Born
IBM Research
AI 4 ScienceLanguage ModelsQuantum MLMachine Learning