🤖 AI Summary
Existing medical image segmentation methods rely heavily on precise bounding boxes or domain-specific textual prompts, exhibiting poor generalizability to natural, non-expert user queries. Method: We propose Medical Semantic Segmentation and Detection (MedSD), a novel task requiring pixel-level segmentation and object localization solely from colloquial, logic-implicit natural language queries. To this end, we formally define the MedSD task; introduce MLMR-SD, a multi-perspective logic-driven dataset enabling annotation-free, box-free, and terminology-free input; and design an end-to-end architecture integrating vision-language understanding, logical reasoning modeling, and multi-granularity localization decoding, augmented by an implicit-reasoning prompt-guided cross-modal alignment mechanism. Contribution/Results: Extensive experiments demonstrate that our method significantly outperforms conventional referring segmentation approaches on MedSD, validating its robust comprehension of lay-user utterances and accurate pixel-level localization—without requiring expert annotations or structured prompts.
📝 Abstract
Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.