🤖 AI Summary
To address the challenges of metadata integration and discovery across distributed, heterogeneous data sources in scientific research and library environments, this paper proposes the “Metadata Lake” paradigm—extending the data lake concept to metadata management. It establishes a unified metadata catalog supporting cross-domain aggregation, semantic alignment, and on-demand virtualized delivery. Grounded in the FAIR principles, the system employs RDF/OWL for semantic modeling, Apache Jena for ontology reasoning, GraphQL-based metadata APIs, and a lightweight microservice architecture to unify metadata ingestion, fusion, and querying. Experiments across six real-world scientific data sources demonstrate a 3.2× improvement in metadata discovery efficiency, 91.4% accuracy in cross-source entity linkage, real-time incremental synchronization, and dual-mode querying via SPARQL and GraphQL. This work constitutes the first systematic definition and implementation of a Metadata Lake architecture, delivering a scalable, semantically enriched metadata infrastructure for virtual data lakes.
📝 Abstract
Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata aggregator is presented and also evaluated.