A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenges of high annotation costs and scarce samples in microscopic image classification by proposing a goal-driven active learning framework that, for the first time, integrates human feedback into the prompt selection process of vision-language models (VLMs). Operating on a small pool of unlabeled data, the method combines three complementary active learning criteria to prioritize the most informative samples for expert validation, while leveraging the VLM to generate initial image–text pairs to reduce manual effort. Experimental results demonstrate that the approach achieves 100% test accuracy with only an average of 20 expert-annotated samples, substantially outperforming random sampling strategies and enabling highly efficient, human-in-the-loop biomedical image analysis under extremely limited labeling resources.

📝 Abstract

Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

prompt selection

vision-language models

active learning

microscopy image classification

human-in-the-loop

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-in-the-loop

prompt selection

vision-language models