PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Precision oncology for lung cancer urgently requires low-cost, non-invasive methods to predict genomic alterations. Existing AI models applied to histopathological whole-slide images (WSIs) suffer from limited performance in identifying mutation subtypes, precisely localizing exon positions, and estimating tumor mutational burden (TMB), largely due to coarse-grained molecular annotations. Method: We construct the first multi-center, pathology–molecular paired dataset for lung cancer (2,024 cases, integrating data from Xiangya Second Hospital and TCGA-LUAD), systematically annotated with fine-grained molecular labels—including driver gene mutations, mutation subtypes, exact exon coordinates, and TMB status. We propose a multi-task framework based on multi-instance learning, integrating WSI tile embeddings, attention-based aggregation, and joint optimization. Results: Our model achieves an AUC of 0.89 for mutation prediction, median exon localization error of <2 exons, and 82.3% accuracy for TMB classification—establishing a clinically translatable paradigm for non-invasive genomic profiling.

Technology Category

Application Category

📝 Abstract
Accurately predicting gene mutations, mutation subtypes and their exons in lung cancer is critical for personalized treatment planning and prognostic assessment. Faced with regional disparities in medical resources and the high cost of genomic assays, using artificial intelligence to infer these mutations and exon variants from routine histopathology images could greatly facilitate precision therapy. Although some prior studies have shown that deep learning can accelerate the prediction of key gene mutations from lung cancer pathology slides, their performance remains suboptimal and has so far been limited mainly to early screening tasks. To address these limitations, we have assembled PathGene, which comprises histopathology images paired with next-generation sequencing reports from 1,576 patients at the Second Xiangya Hospital, Central South University, and 448 TCGA-LUAD patients. This multi-center dataset links whole-slide images to driver gene mutation status, mutation subtypes, exon, and tumor mutational burden (TMB) status, with the goal of leveraging pathology images to predict mutations, subtypes, exon locations, and TMB for early genetic screening and to advance precision oncology. Unlike existing datasets, we provide molecular-level information related to histopathology images in PathGene to facilitate the development of biomarker prediction models. We benchmarked 11 multiple-instance learning methods on PathGene for mutation, subtype, exon, and TMB prediction tasks. These experimental methods provide valuable alternatives for early genetic screening of lung cancer patients and assisting clinicians to quickly develop personalized precision targeted treatment plans for patients. Code and data are available at https://github.com/panliangrui/NIPS2025/.
Problem

Research questions and friction points this paper is trying to address.

Predicting lung cancer gene mutations from histopathology images
Improving AI-based exon and mutation subtype prediction
Enabling precision oncology with multi-center biomarker datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multicenter dataset links slides to mutations
AI predicts mutations from pathology images
Benchmarked 11 learning methods for screening
🔎 Similar Papers
No similar papers found.
L
Liangrui Pan
Hunan University
Q
Qingchun Liang
The Second Xiangya Hospital
S
Shen Zhao
Sun Yat-sen University
S
Songqing Fan
The Second Xiangya Hospital
Shaoliang Peng
Shaoliang Peng
Cheung Kong Professor, Hunan University
High Performance ComputingBig DataBioinformaticsAI