Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

πŸ“… 2026-05-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

200K/year
πŸ€– AI Summary
This study addresses the susceptibility of multimodal large language models (MLLMs) to central tendency bias in clinical ordinal scoring, particularly their difficulty in accurately identifying extreme scores in cognitive impairment screening. For the first time, MLLMs are applied to the Clock Drawing Test (CDT) scoring task, with systematic evaluation conducted on two public datasets using the Shulman criteria. The performance of fully fine-tuned Vision Transformers is compared against state-of-the-art MLLMs. Results show that Vision Transformers achieve the best calibration (MAE = 0.52, within-1 accuracy = 91%), while zero-shot MLLMs exhibit high tolerance accuracy (e.g., GPT-5 reaches 92%) but consistently compress predictions toward the scale centerβ€”a systematic bias not easily mitigated by prompt engineering. This work is the first to reveal the impact of central tendency effects on high-stakes clinical decision-making, extending NLP bias research into clinical assessment contexts.
πŸ“ Abstract
Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.
Problem

Research questions and friction points this paper is trying to address.

central tendency bias
multimodal LLMs
clinical ordinal scoring
Clock Drawing Test
automated evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

central tendency bias
multimodal LLMs
clinical ordinal scoring
calibration
Clock Drawing Test
πŸ”Ž Similar Papers
No similar papers found.
Jiaqing Zhang
Jiaqing Zhang
University of Science and Technology of China
Recommender SystemData-Centric AI
S
Sandeep Elluri
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611
B
Bhanu Cherukuvada
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611
Y
Yonah Joffe
Department of Clinical and Health Psychology, University of Florida, Gainesville, FL 32611
Jessica Sena
Jessica Sena
Postdoc in Biomedical Engineering, University of Florida
Medical AIMachine Learning for HealthMedical Artificial IntelligenceDigital Health
M
Miguel Contreras
Department of Biomedical Engineering, University of Florida, Gainesville, FL 32611
Scott Siegel
Scott Siegel
PhD Student, University of Florida
Machine Learning
S
Subhash Nerella
Department of Biomedical Engineering, University of Florida, Gainesville, FL 32611
C
Catherine E. Price
Department of Clinical and Health Psychology, University of Florida, Gainesville, FL 32611
Parisa Rashidi
Parisa Rashidi
University of Florida
Machine Learning for HealthMedical Artificial IntelligenceMedical AIDigital Health