Towards Conversational Medical AI with Eyes, Ears and a Voice

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses the limitations of existing medical dialogue AI systems, which rely solely on text and thus fail to capture critical audiovisual cues present in real clinical encounters, lacking real-time multimodal decision support. The authors propose AI co-clinician, the first real-time dual-agent medical dialogue system capable of processing continuous audiovisual input. By leveraging a low-latency architecture based on Gemini, it enables collaborative deep clinical reasoning and natural conversational interaction through a novel dual-agent coordination mechanism. The work introduces a standardized outpatient simulation environment, along with the TelePACES multidimensional evaluation framework and case-specific scoring criteria. In 120 simulated consultations, the system approached the performance of primary care physicians in core dimensions such as diagnostic planning and differential diagnosis, significantly outperforming GPT-Realtime, though room for improvement remains in specialty-specific physical examination and disease-focused reasoning.

📝 Abstract

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

Problem

Research questions and friction points this paper is trying to address.

conversational medical AI

audio-visual cues

real-time clinical decision

telemedicine

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal medical AI

real-time clinical reasoning

audio-visual dialogue system