🤖 AI Summary
Current WSI-based survival prediction methods suffer from two major limitations: coarse-grained vision–language alignment and neglect of the inherent hierarchical tissue structure. To address these, we propose a hierarchical vision–language协同 framework. First, we construct multi-attribute textual prompts and integrate optimal prompt learning to achieve fine-grained, multi-scale vision–text feature alignment. Second, we design a cross-level propagation module and a mutual contrastive learning mechanism to explicitly model patch–region interactions and strengthen hierarchical representation learning. Leveraging a pre-trained feature extractor, our method obtains multi-level visual representations. Evaluated on three TCGA cancer cohorts, it achieves state-of-the-art performance in survival prediction—significantly improving both predictive accuracy and generalizability—while effectively alleviating representation bottlenecks arising from label sparsity and insufficient hierarchical modeling.
📝 Abstract
Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance.