Knows: Agent-Native Structured Research Representations

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing research documents are predominantly published in human-oriented formats such as PDF, which hinders efficient and accurate extraction of fine-grained information by large language model (LLM) agents. This work proposes Knows—a lightweight companion specification that binds structured claims, evidence, sources, and their verifiable relationships to the original paper via YAML-formatted sidecar files (KnowsRecords), enabling direct consumption by agents without modifying the source document. By integrating deterministic schema validation with hybrid reasoning across LLMs of varying scales, the approach achieves a 29–42 percentage point improvement in accuracy on weaker models while reducing input tokens by 29–86%. Its scalability and practicality are further demonstrated through integration with a community platform hosting over 10,000 scholarly articles.

Technology Category

Application Category

📝 Abstract

Research artifacts are distributed primarily as reader-oriented documents like PDFs. This creates a bottleneck for increasingly agent-assisted and agent-native research workflows, in which LLM agents need to infer fine-grained, task-relevant information from lengthy full documents, a process that is expensive, repetitive, and unstable at scale. We introduce Knows, a lightweight companion specification that binds structured claims, evidence, provenance, and verifiable relations to existing research artifacts in a form LLM agents can consume directly. Knows addresses the gap with a thin YAML sidecar (KnowsRecord) that coexists with the original PDF, requiring no changes to the publication itself, and validated by a deterministic schema linter. We evaluate Knows on 140 comprehension questions across 20 papers spanning 14 academic disciplines, comparing PDF-only, sidecar-only, and hybrid conditions across six LLM agents of varying capacity. Weak models (0.8B--2B parameters) improve from 19--25\% to 47--67\% accuracy (+29 to +42 percentage points) when reading sidecar instead of PDF, while consuming 29--86\% fewer input tokens; an LLM-as-judge re-scoring confirms that weak-model sidecar accuracy (75--77\%) approaches stronger-model PDF accuracy (78--83\%). Beyond this controlled evaluation, a community sidecar hub at https://knows.academy/ has already indexed over ten thousand publications and continues to grow daily, providing independent evidence that the format is adoption-ready at scale.

Problem

Research questions and friction points this paper is trying to address.

research representation

LLM agents

structured data

scientific documents

agent-native workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured research representation

LLM agent

sidecar format