CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Inaccurate bibliographic metadata extraction arises from the high heterogeneity of web layouts and formats across scholarly publishers. Method: This paper proposes a context-aware, multi-source document joint ranking approach: starting from identifiers such as DOIs, it crawls landing pages and associated resources (e.g., PDFs, ORCID profiles, supplementary materials), then jointly models anchor texts, URL structural features, and multimodal embeddings. Contribution/Results: We introduce the first cross-publisher benchmark dataset—covering six top-tier computer science conferences/journals and comprising 600 manually annotated samples. Our layout-agnostic, format-robust ranking framework overcomes the limitations of single-page parsing. Experiments demonstrate significant improvements in ranking accuracy for relevant literature across diverse formats and publishers. This work advances automated metadata extraction from heterogeneous scholarly web pages and releases both code and dataset publicly.

Technology Category

Application Category

📝 Abstract

Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.

Problem

Research questions and friction points this paper is trying to address.

Challenges in metadata extraction from diverse web layouts

Robust ranking of linked bibliographic documents across formats

Improving accuracy in publication database metadata extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual ranking of linked web documents

Unified embedding of diverse web resources

Layout-independent ranking across publishers

🔎 Similar Papers

No similar papers found.