AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents

๐Ÿ“… 2025-09-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

194K/year
๐Ÿค– AI Summary
Pharmacokinetic (PK) tables exhibit complex structures and terminological heterogeneity, severely impeding automated data extraction and standardization. To address this, we propose AutoPK: a two-stage framework wherein Stage I leverages large language models (LLMs) to identify variant expressions of PK parameters, and Stage II integrates semantic similarity measurement, LLM-based validation, and key-value text transformation to achieve precise parameter normalization. Our key innovations include a hybrid similarity metric and a lightweight verification feedback loop, which substantially mitigate LLM hallucination. Evaluated on 605 real-world PK tables, AutoPK achieves F1-scores of 0.92 for half-life and 0.91 for clearance using LLaMA-3.1-70B. With the smaller Gemma-3-27B, it improves F1 by 2โ€“7ร— and reduces hallucination rates from 60โ€“95% to 8โ€“14%, outperforming leading commercial systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments. However, PK data are often embedded in complex, heterogeneous tables with variable structures and inconsistent terminologies, posing significant challenges for automated PK data retrieval and standardization. AutoPK, a novel two-stage framework for accurate and scalable extraction of PK data from complex scientific tables. In the first stage, AutoPK identifies and extracts PK parameter variants using large language models (LLMs), a hybrid similarity metric, and LLM-based validation. The second stage filters relevant rows, converts the table into a key-value text format, and uses an LLM to reconstruct a standardized table. Evaluated on a real-world dataset of 605 PK tables, including captions and footnotes, AutoPK shows significant improvements in precision and recall over direct LLM baselines. For instance, AutoPK with LLaMA 3.1-70B achieved an F1-score of 0.92 on half-life and 0.91 on clearance parameters, outperforming direct use of LLaMA 3.1-70B by margins of 0.10 and 0.21, respectively. Smaller models such as Gemma 3-27B and Phi 3-12B with AutoPK achieved 2-7 fold F1 gains over their direct use, with Gemma's hallucination rates reduced from 60-95% down to 8-14%. Notably, AutoPK enabled open-source models like Gemma 3-27B to outperform commercial systems such as GPT-4o Mini on several PK parameters. AutoPK enables scalable and high-confidence PK data extraction, making it well-suited for critical applications in veterinary pharmacology, drug safety monitoring, and public health decision-making, while addressing heterogeneous table structures and terminology and demonstrating generalizability across key PK parameters. Code and data: https://github.com/hosseinsholehrasa/AutoPK
Problem

Research questions and friction points this paper is trying to address.

Extracting pharmacokinetic data from complex heterogeneous tables with variable structures
Addressing inconsistent terminologies in automated PK data retrieval and standardization
Improving precision and recall for PK parameter extraction from scientific documents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs and hybrid similarity for parameter extraction
Converts tables to key-value format for standardization
Validates and reconstructs tables with LLM-based filtering
๐Ÿ’ผ Related Jobs
Postdoctoral Fellow โ€“ AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizerโ€™s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of lifeโ€™s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site โ€“ U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
AI Data Engineer--LLMs / Agentic Systems
Pfizer
The annual base salary for this position ranges from $106,000.00 to $176,600.00. In addition, this position is eligible for participation in Pfizerโ€™s Global Performance Plan with a bonus target of 15.0% of the base salary and eligibility to participate in our share based long term incentive program. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of lifeโ€™s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site โ€“ U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
United States - Massachusetts - Cambridge
H
Hossein Sholehrasa
1DATA Consortium and FARAD Program, Kansas State University, Olathe, KS, USA
A
Amirhossein Ghanaatian
Department of Computer Science, Kansas State University, Manhattan, KS, USA
Doina Caragea
Doina Caragea
Kansas State University
deep learningtext miningdata miningdata science
L
Lisa A. Tell
FARAD, Department of Medicine and Epidemiology, University of California-Davis, Davis, CA, USA
J
Jim E. Riviere
1DATA Consortium and FARAD Program, Kansas State University, Olathe, KS, USA
Majid Jaberi-Douraki
Majid Jaberi-Douraki
Kansas State University
Mathematical BiologyBig DataData ScienceOne Health1DATA