๐ค AI Summary
This work addresses the limitations faced by large language model (LLM) agents in scientific researchโnamely high token consumption, inefficient retrieval, and unstable evidence stemming from unstructured literature formats such as PDFs and HTML. To overcome these challenges, the paper introduces the first agent-oriented data interface architecture for scientific literature, which automatically transforms raw documents into structured JSON through a three-layer system enabling multimodal access. The interface integrates a RESTful API, Python SDK, CLI, and MCP protocol, supports daily synchronization of the full arXiv corpus, and is extensible to other platforms like PubMed Central. The authors release an open-source toolchain and a web demonstration system, significantly enhancing the efficiency, accuracy, and programmability of LLM agents in processing scientific literature.
๐ Abstract
Research agents are increasingly used in AI4Science for scientific information seeking and evidence-grounded decision making. Yet a persistent bottleneck is paper access: agents typically retrieve PDF/HTML pages, heuristically parse them, and ingest long unstructured text, leading to token-heavy reading and brittle evidence lookup. This motivates an agentic data interface for scientific papers that standardizes access, exposes budget-aware views, and treats grounding as a first-class operation. We introduce DeepXiv-SDK, which enables progressive access aligned with how agents allocate attention and reading budget. DeepXiv-SDK exposes as structured views a header-first view for screening, a section-structured view for targeted navigation, and on-demand evidence-level access for verification. Each layer is augmented with enriched attributes and explicit budget hints, so agents can balance relevance, cost, and grounding before escalating to full-text processing. DeepXiv-SDK also supports multi-faceted retrieval and aggregation over paper attributes, enabling constraint-driven search and curation over paper sets. DeepXiv-SDK is currently deployed at arXiv scale with daily synchronization to new releases and is designed to extend to other open-access corpora (e.g., PubMed Central, bioRxiv). We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows; the service is free to use with registration.