Affective-ROPTester: Capability and Bias Analysis of LLMs in Predicting Retinopathy of Prematurity

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two critical gaps in applying large language models (LLMs) to retinopathy of prematurity (ROP) risk stratification: limited predictive accuracy and unexamined affective bias. To this end, we introduce CROP—the first Chinese clinical text benchmark dataset for ROP—and propose Affective-ROPTester, a novel evaluation framework integrating instruction tuning, chain-of-thought reasoning, in-context learning, and affective prompting. Crucially, we pioneer the use of affective prompts as a bias analysis tool: experiments demonstrate that positive affective guidance significantly mitigates LLMs’ systemic overestimation of medium-to-high ROP risk, improving prediction calibration; additionally, integrating external medical knowledge further enhances model performance. Our findings reveal that affect-sensitive prompt engineering substantially influences the reliability of clinical AI systems. This study establishes a new paradigm for trustworthy deployment of medical LLMs, emphasizing the role of emotion-aware design in mitigating clinical decision biases.

Technology Category

Application Category

📝 Abstract
Despite the remarkable progress of large language models (LLMs) across various domains, their capacity to predict retinopathy of prematurity (ROP) risk remains largely unexplored. To address this gap, we introduce a novel Chinese benchmark dataset, termed CROP, comprising 993 admission records annotated with low, medium, and high-risk labels. To systematically examine the predictive capabilities and affective biases of LLMs in ROP risk stratification, we propose Affective-ROPTester, an automated evaluation framework incorporating three prompting strategies: Instruction-based, Chain-of-Thought (CoT), and In-Context Learning (ICL). The Instruction scheme assesses LLMs' intrinsic knowledge and associated biases, whereas the CoT and ICL schemes leverage external medical knowledge to enhance predictive accuracy. Crucially, we integrate emotional elements at the prompt level to investigate how different affective framings influence the model's ability to predict ROP and its bias patterns. Empirical results derived from the CROP dataset yield two principal observations. First, LLMs demonstrate limited efficacy in ROP risk prediction when operating solely on intrinsic knowledge, yet exhibit marked performance gains when augmented with structured external inputs. Second, affective biases are evident in the model outputs, with a consistent inclination toward overestimating medium- and high-risk cases. Third, compared to negative emotions, positive emotional framing contributes to mitigating predictive bias in model outputs. These findings highlight the critical role of affect-sensitive prompt engineering in enhancing diagnostic reliability and emphasize the utility of Affective-ROPTester as a framework for evaluating and mitigating affective bias in clinical language modeling systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' capability in predicting retinopathy of prematurity risk
Analyzing affective biases in LLMs for ROP risk stratification
Assessing impact of emotional framing on predictive accuracy and bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CROP dataset for ROP risk prediction
Uses Instruction, CoT, and ICL prompting strategies
Integrates emotional elements to reduce bias
🔎 Similar Papers
No similar papers found.
S
Shuai Zhao
College of Computing and Data Science, Nanyang Technological University, Singapore, 639798
Y
Yulin Zhang
Department of Ophthalmology, Huizhou First Hospital, Huizhou, Guangdong, China, 516000
Luwei Xiao
Luwei Xiao
Nanyang Technological University
LLMsMultimodal InteractionSentiment AnalysisHuman-in-the-loopAI for Healthcare
X
Xinyi Wu
School of Education, Shanghai Jiao Tong University, Shanghai, China, 200240
Yanhao Jia
Yanhao Jia
Nanyang Technological University
Artificial IntelligenceDeep LearningComputational Neuroscience
Zhongliang Guo
Zhongliang Guo
University of St Andrews
Computer VisionAdversarial AttackAdversarial SamplesTrustworthy AI
Xiaobao Wu
Xiaobao Wu
Research Scientist, Nanyang Technological University
Large Language ModelsMachine LearningNatural Language Processing
C
Cong-Duy Nguyen
College of Computing and Data Science, Nanyang Technological University, Singapore, 639798
G
Guoming Zhang
Shenzhen Eye Hospital, Shenzhen Eye Medical Center, Southern Medical University, Shenzhen, China, 518040
A
Anh Tuan Luu
College of Computing and Data Science, Nanyang Technological University, Singapore, 639798