π€ AI Summary
This study addresses dietary sodium counseling for heart failure patientsβa high-stakes clinical task requiring accuracy, interpretability, and reliability. Method: We conducted the first controlled, task-oriented comparison between a neurosymbolic dialogue assistant and a generative large language model (ChatGPT) in real-world clinical settings. The neurosymbolic system integrates a rule engine, an embedded clinical knowledge graph, and a fine-tuned lightweight language model, augmented with speech interaction and a curated dietary knowledge base; ChatGPT API served as the baseline. Results: The neurosymbolic system achieved significantly higher accuracy and task completion rate (+23%), produced more concise responses, and ensured greater controllability and transparency. While ChatGPT exhibited marginally fewer speech recognition errors and required fewer clarifications, patient preference showed no statistically significant difference. Contribution: We propose a lightweight, controllable neurosymbolic dialogue paradigm tailored for health counseling and empirically demonstrate its superiority over purely generative approaches in safety-critical, reliability-demanding medical Q&A.
π Abstract
Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.