🤖 AI Summary
This study addresses the challenge of efficiently reviewing vast interaction data generated by remote cognitive rehabilitation systems, particularly in low-resource clinical settings where reference reports are unavailable and reliable automated report generation methods are lacking. Under identical structured clinical variable inputs, it presents the first systematic comparison between knowledge-engineered templates and zero-shot GPT-4 for clinical report generation. The work introduces a reproducible methodology for clinical natural language generation that integrates expert knowledge extraction, classification-driven generation, and multidimensional human evaluation. Using a nine-dimensional clinical questionnaire with statistical significance correction, results indicate that template-based systems are preferred for fluency, coherence, and result presentation, whereas GPT-4 produces more concise outputs. Although differences did not reach statistical significance, consistent trends offer empirical grounding and eight actionable design recommendations for the responsible deployment of medical AI in resource-constrained environments.
📝 Abstract
The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians to review efficiently. This paper investigates automated clinical report generation for avatar-guided, home-based cognitive remediation sessions in a low-resource setting with no reference reports. We present and compare two approaches: (1) a rule-based template system encoding speech therapy domain knowledge as explicit decision rules and validated templates, ensuring clinical reliability and traceability; and (2) a zero-shot LLM-based approach (GPT-4) aimed at more fluent and concise output. Both systems use identical pre-extracted, expert-validated structured variables, enabling a controlled factual comparison. Outputs were evaluated by eight speech therapists and final-year students using a nine-criterion questionnaire. Results reveal a clear trade-off between clinical reliability and linguistic quality. The template-based system scored higher on fluidity, coherence, and results presentation, while GPT-4 produced more concise output. Directional differences are consistent across evaluation dimensions, though no comparison reached statistical significance after correction, reflecting the scale constraints of expert clinical evaluation. Based on evaluator feedback, we derive eight design recommendations for clinical reporting systems in remote rehabilitation settings. More broadly, this work contributes a replicable methodology combining expert elicitation, taxonomy-driven generation, and multi-dimensional human evaluation for clinical NLG in low-resource settings, and illustrates how controlled comparisons can inform the responsible adoption of generative AI in healthcare.