🤖 AI Summary
Cell image segmentation faces challenges due to modality heterogeneity, morphological diversity, and severe scarcity of annotated data. To address these, we propose the first training-free, long-term memory-enabled multi-agent segmentation framework, operating via a planning-execution-evaluation closed loop. It dynamically schedules domain-specific tools, achieves cross-modal adaptation, enables text-guided organelle segmentation (e.g., Golgi apparatus), and preserves expert knowledge. The framework integrates large language model agents, vision-language models, on-demand segmentation model dispatching, reference-image-based zero-shot adaptation, and human-in-the-loop feedback evaluation. Evaluated on four benchmarks, it achieves an average accuracy improvement of 15.7% and boosts IoU by 37.6% for mitochondria and endoplasmic reticulum segmentation. Crucially, it substantially reduces annotation cost while outperforming existing state-of-the-art methods.
📝 Abstract
Cellular image segmentation is essential for quantitative biology yet remains difficult due to heterogeneous modalities, morphological variability, and limited annotations. We present GenCellAgent, a training-free multi-agent framework that orchestrates specialist segmenters and generalist vision-language models via a planner-executor-evaluator loop (choose tool $
ightarrow$ run $
ightarrow$ quality-check) with long-term memory. The system (i) automatically routes images to the best tool, (ii) adapts on the fly using a few reference images when imaging conditions differ from what a tool expects, (iii) supports text-guided segmentation of organelles not covered by existing models, and (iv) commits expert edits to memory, enabling self-evolution and personalized workflows. Across four cell-segmentation benchmarks, this routing yields a 15.7% mean accuracy gain over state-of-the-art baselines. On endoplasmic reticulum and mitochondria from new datasets, GenCellAgent improves average IoU by 37.6% over specialist models. It also segments novel objects such as the Golgi apparatus via iterative text-guided refinement, with light human correction further boosting performance. Together, these capabilities provide a practical path to robust, adaptable cellular image segmentation without retraining, while reducing annotation burden and matching user preferences.