From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing AI-based clinical note evaluation methods suffer from misalignment between automated metrics and physician preferences, alongside the subjectivity and scalability limitations of expert reviews. Method: We propose Feedback2Checklist—a novel framework that distills large-scale, de-identified real-world clinical feedback into structured, interpretable, and actionable evaluation checklists, and builds an LLM-powered automated evaluator. Contribution/Results: Our approach significantly improves agreement between automated scores and physician preferences (+32.7% Spearman correlation), demonstrating high coverage, diversity, and robustness to quality degradation. Offline experiments show superior performance in identifying low-quality clinical notes compared to baselines, while ensuring strong clinical alignment and practical deployability.

Technology Category

Application Category

📝 Abstract

AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist's robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.

Problem

Research questions and friction points this paper is trying to address.

Evaluating quality of AI-generated clinical notes effectively

Aligning automated metrics with physician preferences accurately

Creating interpretable checklists from real user feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills user feedback into structured checklists

Uses LLM-based evaluators for enforcement

Leverages 21,000 clinical encounters for validation

🔎 Similar Papers

Improving Clinical Note Generation from Complex Doctor-Patient Conversation