🤖 AI Summary
Plant morphological trait data are scarce, and manual curation is labor-intensive and time-consuming. Method: This paper proposes a large language model (LLM)-driven information extraction approach that requires neither manual annotation nor domain-specific rules. Leveraging prompt engineering and in-context learning, the method integrates multi-source web text retrieval with standardized post-processing to extract species–trait relationships. Contribution/Results: It is the first systematic validation of LLMs for zero-shot, cross-source species–trait relation extraction. The method bypasses traditional reliance on structured databases or handcrafted rules, successfully reconstructing three authoritative manually curated matrices—covering over 50% of species–trait pairs—with an F1-score exceeding 75%. This offers a scalable, low-barrier pathway for constructing large-scale, structured plant trait databases.
📝 Abstract
Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the traits of interest.