PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the lack of explicit modeling of literature-based evidence in existing plant marker gene resources, which hinders the evaluation of language models’ capacity to comprehend scientific evidence. To bridge this gap, the authors introduce PlantMarkerBench, the first multi-species benchmark comprising 5,550 expert-annotated sentence-level instances across four plant species, structured around two core tasks: evidence validity assessment and evidence type classification. The benchmark integrates hybrid retrieval, species-aware biological grounding, structured evidence extraction, and a modular evaluation pipeline to enable cross-species assessment under diverse prompting strategies. Experimental results reveal that state-of-the-art models perform well on expression-based evidence but exhibit significant performance degradation on functional, indirect, and weakly supportive evidence types, while open-source models show elevated false-positive rates in ambiguous contexts—highlighting critical limitations in current models’ understanding of biological evidence.

📝 Abstract

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

Problem

Research questions and friction points this paper is trying to address.

plant marker genes

evidence interpretation

scientific literature

cell-type specificity

biological evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

PlantMarkerBench

evidence-grounded reasoning

multi-species benchmark