Hepato-LLaVA: An Expert MLLM with Sparse Topo-Pack Attention for Hepatocellular Pathology Analysis on Whole Slide Images

πŸ“… 2026-02-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing methods in analyzing whole-slide images of hepatocellular carcinoma, which often suffer from fixed-resolution processing and inefficient feature aggregation, leading to information loss or redundancy. To overcome these challenges, the authors propose a multimodal large language model tailored for hepatocellular pathology, featuring a novel sparse Topo-Pack attention mechanism that effectively captures two-dimensional tissue topology and enables efficient aggregation from local diagnostic evidence to global semantic summaries. The study also introduces HepatoPathoVQA, a large-scale, expert-validated dataset comprising 33,000 multi-level pathological visual question-answering pairs, and integrates multi-scale image–text alignment techniques. The proposed approach achieves state-of-the-art performance on both hepatocellular carcinoma diagnosis and image captioning tasks, significantly outperforming current methods.

Technology Category

Application Category

πŸ“ Abstract
Hepatocellular Carcinoma diagnosis relies heavily on the interpretation of gigapixel Whole Slide Images. However, current computational approaches are constrained by fixed-resolution processing mechanisms and inefficient feature aggregation, which inevitably lead to either severe information loss or high feature redundancy. To address these challenges, we propose Hepato-LLaVA, a specialized Multi-modal Large Language Model designed for fine-grained hepatocellular pathology analysis. We introduce a novel Sparse Topo-Pack Attention mechanism that explicitly models 2D tissue topology. This mechanism effectively aggregates local diagnostic evidence into semantic summary tokens while preserving global context. Furthermore, to overcome the lack of multi-scale data, we present HepatoPathoVQA, a clinically grounded dataset comprising 33K hierarchically structured question-answer pairs validated by expert pathologists. Our experiments demonstrate that Hepato-LLaVA achieves state-of-the-art performance on HCC diagnosis and captioning tasks, significantly outperforming existing methods. Our code and implementation details are available at https://pris-cv.github.io/Hepto-LLaVA/.
Problem

Research questions and friction points this paper is trying to address.

Hepatocellular Carcinoma
Whole Slide Images
feature redundancy
information loss
multi-scale data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Topo-Pack Attention
Multi-modal Large Language Model
Whole Slide Image
Hepatocellular Carcinoma
Topological Feature Aggregation
Y
Yuxuan Yang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Zhonghao Yan
Zhonghao Yan
Beijing University of Posts and Telecommunications
Vision Language ModelAgentGenerative AIMedical Image Analysis
Y
Yi Zhang
Department of Pathology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
B
Bo Yun
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
M
Muxi Diao
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
G
Guowei Zhao
Department of Pathology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
Kongming Liang
Kongming Liang
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionMachine Learning
Wenbin Li
Wenbin Li
Department of Magnetic Resonance Imaging, The First Affiliated Hospital of Zhengzhou University
MRIBipolar DisorderDepression
Zhanyu Ma
Zhanyu Ma
Beijing University of Posts and Telecommunications
Pattern RecognitionMachine LearningComputer VisionMultimedia TechnologyDeep Learning