How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Banff scoring of kidney transplant biopsies—being semi-quantitative, rule-complex, and subject to high inter-observer variability—poses significant challenges for reproducible AI implementation. Method: We propose a modular, rule-driven deep learning framework that decouples key lesions (e.g., g, ptc, v) into structural and inflammatory subtasks: segmentation models extract structural features, while detection models localize inflammatory foci; both outputs are mapped to Banff scores via interpretable, heuristic rules. Contribution/Results: Validated on expert-annotated data, the framework achieves accurate scoring for select lesions but uncovers systematic limitations—including inconsistency in intermediate representations, structural omissions, and detection ambiguity. Crucially, this work establishes computational-level standardization as essential for clinical deployment of transplant pathology AI. It provides a methodological paradigm and benchmark for developing interpretable, clinically trustworthy AI systems in renal pathology.

Technology Category

Application Category

📝 Abstract
The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.
Problem

Research questions and friction points this paper is trying to address.

Assessing AI feasibility for Banff lesion scoring
Decomposing histological indicators into computable components
Identifying limitations in computational pathology replication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed Banff indicators into structural components
Mapped model outputs using heuristic expert rules
Evaluated modular AI pipeline against expert annotations
🔎 Similar Papers
No similar papers found.