🤖 AI Summary
This study addresses the lack of domain-specific benchmarks for misinformation detection in life sciences. We introduce FSoLS, the first four-class misinformation dataset tailored to this domain—comprising 2,603 annotated texts spanning 14 scientific topics, 17 source types, and 4 publication formats—thereby bridging gaps in stylistic diversity and systematic annotation. Methodologically, we integrate large language models with traditional machine learning, incorporating linguistically grounded and rhetorical features—particularly those capturing emotional appeal and attention-grabbing patterns—to model information quality. Empirical evaluation confirms the discriminative power of our proposed features in distinguishing credible from misleading content. To foster reproducibility and community advancement, we fully open-source the dataset, annotation guidelines, and data collection code, enabling fine-grained, interpretable misinformation detection research in life sciences.
📝 Abstract
Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences