🤖 AI Summary
This study addresses the challenge of predicting clinically meaningful postoperative improvement in chronic rhinosinusitis patients using preoperative structured clinical data, with the goal of avoiding ineffective surgeries. We present the first systematic comparison between supervised learning models—logistic regression, tree ensembles, and multilayer perceptrons—and leading generative AI systems (ChatGPT, Claude, Gemini, Perplexity) on a real-world clinical decision task, using identical structured inputs and constraining outputs to binary recommendations with confidence scores. The best-performing multilayer perceptron achieved 85% accuracy, demonstrating superior calibration and net benefit on decision curve analysis compared to all generative AI models. Although generative AI exhibited suboptimal predictive performance, its reasoning aligned closely with clinical expertise and feature importance rankings. We propose an interpretable clinical workflow that prioritizes machine learning for prediction while leveraging generative AI for explanatory support, offering a novel paradigm for precision preoperative assessment.
📝 Abstract
Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.