DeepXiv-SDK: An Agentic Data Interface for Scientific Papers

๐Ÿ“… 2026-02-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations faced by large language model (LLM) agents in scientific researchโ€”namely high token consumption, inefficient retrieval, and unstable evidence stemming from unstructured literature formats such as PDFs and HTML. To overcome these challenges, the paper introduces the first agent-oriented data interface architecture for scientific literature, which automatically transforms raw documents into structured JSON through a three-layer system enabling multimodal access. The interface integrates a RESTful API, Python SDK, CLI, and MCP protocol, supports daily synchronization of the full arXiv corpus, and is extensible to other platforms like PubMed Central. The authors release an open-source toolchain and a web demonstration system, significantly enhancing the efficiency, accuracy, and programmability of LLM agents in processing scientific literature.

Technology Category

Application Category

๐Ÿ“ Abstract
Research agents are increasingly used in AI4Science for scientific information seeking and evidence-grounded decision making. Yet a persistent bottleneck is paper access: agents typically retrieve PDF/HTML pages, heuristically parse them, and ingest long unstructured text, leading to token-heavy reading and brittle evidence lookup. This motivates an agentic data interface for scientific papers that standardizes access, exposes budget-aware views, and treats grounding as a first-class operation. We introduce DeepXiv-SDK, which enables progressive access aligned with how agents allocate attention and reading budget. DeepXiv-SDK exposes as structured views a header-first view for screening, a section-structured view for targeted navigation, and on-demand evidence-level access for verification. Each layer is augmented with enriched attributes and explicit budget hints, so agents can balance relevance, cost, and grounding before escalating to full-text processing. DeepXiv-SDK also supports multi-faceted retrieval and aggregation over paper attributes, enabling constraint-driven search and curation over paper sets. DeepXiv-SDK is currently deployed at arXiv scale with daily synchronization to new releases and is designed to extend to other open-access corpora (e.g., PubMed Central, bioRxiv). We release RESTful APIs, an open-source Python SDK, and a web demo showcasing deep search and deep research workflows; the service is free to use with registration.
Problem

Research questions and friction points this paper is trying to address.

LLM-agents
scientific literature
data access
unstructured data
token consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic data interface
structured scientific literature
LLM agents
DeepXiv-SDK
cost-efficient retrieval
๐Ÿ”Ž Similar Papers
No similar papers found.