ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models (LLMs) exhibit limited applicability in Traditional Chinese Medicine (TCM) due to the scarcity of high-quality, domain-specific data and the inherently multimodal nature of TCM diagnostic methods—namely inspection (e.g., tongue and facial imaging), auscultation and olfaction (e.g., speech and odor-derived signal features), inquiry (e.g., spoken interviews), and palpation (e.g., pulse waveforms)—which exceeds the capabilities of unimodal text-based models. Method: We introduce TCM-MLLM, the first multimodal large language model tailored for TCM, trained on the largest curated TCM multimodal dataset to date, encompassing textual records, tongue/facial images, pulse waveforms, spoken consultation audio, and olfactory features (converted via signal transduction). Our approach integrates unified cross-modal representation learning with instruction tuning for end-to-end multisensory joint reasoning. Contribution/Results: TCM-MLLM achieves state-of-the-art performance on TCM licensure examination questions, herbal identification, and visual benchmarks for tongue and facial diagnosis—outperforming same-parameter baselines and matching larger commercial models—thereby significantly advancing multimodal understanding in intelligent TCM diagnosis.

Technology Category

Application Category

📝 Abstract

Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality Traditional Chinese Medicine data

Overcoming multimodal nature of TCM diagnostics beyond text

Enabling unified perception across sound, pulse, smell and vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLM tailored for Traditional Chinese Medicine

Largest TCM dataset with text, images, audio, signals

Unified perception across sound, pulse, smell, vision

🔎 Similar Papers

No similar papers found.

Authors to Follow