🤖 AI Summary
This study addresses the challenge of fragmented pharmacokinetic (PK) data scattered across structurally heterogeneous tables in scientific literature, which hinders efficient and accurate manual extraction. To overcome this, the authors propose a human-centric table understanding approach that integrates natural language processing, computer vision, and structured parsing techniques to develop a specialized AI model. This model automatically interprets complex table layouts in XML-formatted publications, accurately aligns semantic meanings of row and column headers, and extracts PK parameters with high precision. The method significantly enhances both extraction accuracy and scalability, enabling large-scale, automated acquisition of PK data and laying the foundation for a dynamic, continuously updated PK knowledge base.
📝 Abstract
In the field of pharmacology, there is a notable absence of centralized, comprehensive, and up-to-date repositories of PK data. This poses a significant challenge for R&D as it can be a time-consuming and challenging task to collect all the required quantitative PK parameters from diverse scientific publications. This quantitative PK information is predominantly organized in tabular format, mostly available as XML, HTML, or PDF files within various online repositories and scientific publications, including supplementary materials. This makes tables one of the crucial components and information elements of scientific or regulatory documents as they are commonly utilized to present quantitative information. Extracting data from tables is typically a labor-intensive process, and alternative automated machine learning models may struggle to accurately detect and extract the relevant data due to the complex nature and diverse layouts of tabular data. The difficulty of information extraction and reading order detection is largely dependent on the structural complexity of the tables. Efforts to understand tables should prioritize capturing the content of table cells in a manner that aligns with how a human reader naturally comprehends the information. FARAD has been manually extracting tabular data and other information from literature and regulatory agencies for over 40 years. However, there is now an urgent need to automate this process due to the large volume of publications released daily. The accuracy of this task has become increasingly challenging, as manual extraction is tedious and prone to errors, especially given the staffing shortages we are currently facing. This necessitates the development of AI algorithms for table detection and extraction that are able to precisely handle cells organized according to the table structure, as indicated by column and/or row header information.