HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing pathological vision-language models struggle to generate comprehensive structured reports encompassing diagnostic conclusions, histological grading, and ancillary test results. This work proposes HiPath, a lightweight framework that, while keeping the UNI² and Qwen³ backbones frozen, integrates three key modules—Hierarchical Patch Aggregator (HiPA), hierarchical contrastive learning based on optimal transport (HiCL), and slot-based masked diagnostic prediction (Slot-MDP)—to achieve hierarchical vision-language alignment and structured report generation from multi-image inputs. Evaluated on a large-scale Chinese pathology dataset of 749K cases, HiPath achieves a strict accuracy of 68.9%, a clinically acceptable accuracy of 74.7%, and a safety rate of 97.3%. Notably, its performance declines by only 3.4 percentage points in cross-hospital evaluation, demonstrating strong robustness and clinical utility.

Technology Category

Application Category

📝 Abstract
Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.
Problem

Research questions and friction points this paper is trying to address.

structured pathology report
vision-language model
multi-granular diagnosis
hierarchical alignment
pathology report prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Vision-Language Alignment
Structured Pathology Report Generation
Frozen Backbone Adaptation
Optimal Transport-based Contrastive Learning
Slot-based Diagnosis Prediction
🔎 Similar Papers
No similar papers found.
R
Ruicheng Yuan
College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
Zhenxuan Zhang
Zhenxuan Zhang
Georgia Institute of Technology
Anbang Wang
Anbang Wang
Taiyuan University of Technology
nonlinear laser dynamicsbroadband chaos generationchaos OTDR
L
Liwei Hu
Department of Bioengineering and Imperial-X, Imperial College London, London, UK
X
Xiangqian Hua
Department of Pathology, Xiangtan Maternal and Child Health Hospital, Xiangtan, Hunan, China
Y
Yaya Peng
Department of Pathology, The First People’s Hospital of Xiangtan City, Xiangtan, Hunan, China
Jiawei Luo
Jiawei Luo
Professor of Computer Science, Hunan University
bioinformaticsdata mining
G
Guang Yang
Department of Bioengineering and Imperial-X, Imperial College London, London, UK