MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the lack of a standardized evaluation framework for cellular microscopy image representation, which has hindered fair comparisons across methods. To this end, we introduce MorphoHELM—an open benchmark tailored for Cell Painting morphological profiling—that systematically consolidates and refines existing evaluation protocols. Our benchmark incorporates controlled technical noise to rigorously quantify the robustness of representation methods under varying batch effects and their capacity to capture biologically relevant signals. Through comprehensive evaluation of diverse deep learning and classical computer vision approaches on a unified dataset, we find that no single model consistently outperforms traditional handcrafted features, which remain the most generally effective strategy. All code, data, and tools are publicly released to foster reproducibility and further research.

📝 Abstract

Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.

Problem

Research questions and friction points this paper is trying to address.

morphology assays

representation evaluation

benchmarking

batch effects

Cell Painting

Innovation

Methods, ideas, or system contributions that make the work stand out.

MorphoHELM

representation evaluation

batch effect robustness