🤖 AI Summary
This work addresses the lack of video-modal modeling capability in computational pathology by introducing the first video-multimodal large language model for pathological diagnosis. Methodologically, it integrates single-slide images, automatically extracted key-frame video clips, and manually segmented pathological videos to emulate the clinician’s natural workflow—“viewing slides → describing findings → reasoning → diagnosing.” It pioneers temporal video modeling in pathology analysis, constructs the first video-chain-of-thought instruction dataset (VideoPath-Instruct), and proposes a two-stage transfer learning paradigm to mitigate annotation scarcity. Built upon LLaVA, the model extends the visual encoder with CLIP-ViP features, temporal attention mechanisms, and instruction tuning to enable multi-granularity perception and generative diagnostic reasoning. The approach establishes a new benchmark on pathological video diagnosis, achieving significant improvements in both accuracy and interpretability. All code, data, and models are publicly released.
📝 Abstract
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.