🤖 AI Summary
This work addresses the challenges of complex post-processing and severe foreground-background imbalance in nucleus detection within histopathology images by reframing the task as a sequence prediction problem of nuclear center coordinates. The proposed method leverages a multimodal large language model to directly generate coordinate sequences from input images, introducing a novel spatially aware soft supervision mechanism and a visual chain-of-thought strategy. It further incorporates a reinforcement learning fine-tuning scheme featuring distribution-matching rewards, low-variance group filtering, and fine-grained advantage shaping. Evaluated across nine mainstream benchmark datasets, the approach significantly outperforms existing state-of-the-art methods, demonstrating both high efficiency and strong generalization capability.
📝 Abstract
Nucleus detection in histopathology is pivotal for a wide range of clinical applications. Existing approaches either regress nuclear proxy maps that require complex post-processing, or employ dense anchors or queries that introduce severe foreground-background imbalance. In this work, we reformulate nucleus detection as next-point prediction, wherein a multimodal large language model is developed to directly output foreground nucleus centroids from the input image. The model is trained in two stages. In the supervised learning stage, we propose spatial-aware soft supervision to relax strict centroid matching and a chain-of-visual-thought strategy to incorporate visual priors that facilitate coordinate prediction. In the reinforcement fine-tuning stage, we design distribution matching reward, low-variance group filtering, and fine-grained advantage shaping to further improve the model's detection quality. Extensive experiments on nine widely used benchmarks demonstrate the superiority of our method. Code will be released soon.