PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of large-scale, high-precision temporally annotated data in clinical narratives. We introduce the first open-source PubMed-based clinical temporal corpus, automatically extracted from 125,000 open-access case reports, comprising over 5.6 million timestamped clinical events and structured patient trajectories. Methodologically, we propose an LLM-driven, multi-stage temporal event extraction framework—built upon prompt engineering pipelines with Llama 3.3 and DeepSeek-R1—and integrate heuristic filtering with semantic matching. Our key contribution is a novel three-dimensional clinical credibility evaluation framework, combining cosine similarity, concordance index (c-index), and area under the longitudinal time–concurrence curve (AULTC). The framework achieves an event matching rate of 80% (cosine ≥ 0.1) and temporal consistency c-index > 0.90. In downstream survival prediction, it attains a c-index of 0.82 ± 0.01, demonstrating strong predictive power of its temporal representations.

Technology Category

Application Category

📝 Abstract
Understanding temporal dynamics in clinical narratives is essential for modeling patient trajectories, yet large-scale temporally annotated resources remain limited. We present PMOA-TTS, the first openly available dataset of 124,699 PubMed Open Access (PMOA) case reports, each converted into structured (event, time) timelines via a scalable LLM-based pipeline. Our approach combines heuristic filtering with Llama 3.3 to identify single-patient case reports, followed by prompt-driven extraction using Llama 3.3 and DeepSeek R1, resulting in over 5.6 million timestamped clinical events. To assess timeline quality, we evaluate against a clinician-curated reference set using three metrics: (i) event-level matching (80% match at a cosine similarity threshold of 0.1), (ii) temporal concordance (c-index>0.90), and (iii) Area Under the Log-Time CDF (AULTC) for timestamp alignment. Corpus-level analysis shows wide diagnostic and demographic coverage. In a downstream survival prediction task, embeddings from extracted timelines achieve time-dependent concordance indices up to 0.82 $pm$ 0.01, demonstrating the predictive value of temporally structured narratives. PMOA-TTS provides a scalable foundation for timeline extraction, temporal reasoning, and longitudinal modeling in biomedical NLP. The dataset is available at: https://huggingface.co/datasets/snoroozi/pmoa-tts .
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale temporally annotated clinical narrative datasets
Need for scalable methods to extract structured timelines from case reports
Limited resources for temporal reasoning in biomedical NLP applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based pipeline extracts clinical timelines
Heuristic filtering with Llama 3.3
Prompt-driven extraction using multiple models
🔎 Similar Papers
No similar papers found.