🤖 AI Summary
Inaccurate bibliographic metadata extraction arises from the high heterogeneity of web layouts and formats across scholarly publishers. Method: This paper proposes a context-aware, multi-source document joint ranking approach: starting from identifiers such as DOIs, it crawls landing pages and associated resources (e.g., PDFs, ORCID profiles, supplementary materials), then jointly models anchor texts, URL structural features, and multimodal embeddings. Contribution/Results: We introduce the first cross-publisher benchmark dataset—covering six top-tier computer science conferences/journals and comprising 600 manually annotated samples. Our layout-agnostic, format-robust ranking framework overcomes the limitations of single-page parsing. Experiments demonstrate significant improvements in ranking accuracy for relevant literature across diverse formats and publishers. This work advances automated metadata extraction from heterogeneous scholarly web pages and releases both code and dataset publicly.
📝 Abstract
Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.