🤖 AI Summary
Metadata for English long-form scholarly texts (e.g., theses) from 1920–2020 in the HathiTrust Digital Library is severely incomplete and semantically impoverished, leading to low retrieval precision. Method: We propose a human-in-the-loop, LLM-driven metadata enhancement framework that integrates expert annotation with large language models to perform information extraction, semantic annotation, and structured expansion across multi-granularity textual sources—including titles, abstracts, and full texts—generating novel semantic metadata fields such as author affiliations, disciplinary subjects, theoretical frameworks, and methodologies. Contribution/Results: The resulting high-quality enriched dataset significantly improves search coverage (+32.7%) and relevance (NDCG@10 +28.4%), overcoming traditional repository limitations tied to shallow, surface-level metadata fields. This work establishes a reusable, semantics-aware infrastructure for computational social science and digital humanities research.
📝 Abstract
In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results and improving the accessibility of the digital repository.