Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study addresses the lack of domain-specific benchmarks for misinformation detection in life sciences. We introduce FSoLS, the first four-class misinformation dataset tailored to this domain—comprising 2,603 annotated texts spanning 14 scientific topics, 17 source types, and 4 publication formats—thereby bridging gaps in stylistic diversity and systematic annotation. Methodologically, we integrate large language models with traditional machine learning, incorporating linguistically grounded and rhetorical features—particularly those capturing emotional appeal and attention-grabbing patterns—to model information quality. Empirical evaluation confirms the discriminative power of our proposed features in distinguishing credible from misleading content. To foster reproducibility and community advancement, we fully open-source the dataset, annotation guidelines, and data collection code, enabling fine-grained, interpretable misinformation detection research in life sciences.

Technology Category

Application Category

📝 Abstract

Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: https://github.com/EvaSeidlmayer/FourShadesofLifeSciences

Problem

Research questions and friction points this paper is trying to address.

Detect disinformation in life sciences using linguistic features

Classify life-science texts into four distinct publication categories

Address dataset gaps by introducing a labeled corpus (FSoLS)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes linguistic and rhetorical features

Applies large language models and classifiers

Introduces labeled corpus FSoLS

🔎 Similar Papers

Can Large Language Models Detect Misinformation in Scientific News Reporting?