RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the limitations of existing colorectal cancer urgent referral forms, which rely on manual review and are poorly handled by conventional OCR methods due to highly variable layouts, handwritten content, and missing visual evidence. To overcome these challenges, the authors propose an end-to-end vision-language model—specifically, a fine-tuned Qwen3-VL-8B—that jointly extracts structured information from images and localizes supporting visual evidence, thereby enhancing the trustworthiness and auditability of clinical decisions. The work introduces, for the first time, an evidence-localization-based evaluation framework and validates the approach on 223 real-world referral forms. Results show that the fine-tuned model achieves 96.1% extraction accuracy and 60.6% strict safety compliance, significantly outperforming zero-shot vision-language models and enabling reliable, traceable automated referral processing.

📝 Abstract

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

Problem

Research questions and friction points this paper is trying to address.

clinical document understanding

vision-language models

evidence grounding

cancer referral triage

multimodal extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Models

Evidence Grounding

Clinical Document Understanding