Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of lightweight large language models in psychiatric clinical reasoning, which often exhibit hallucinations and shallow reasoning that misalign with expert diagnostic logic. To bridge this gap, the authors propose ClinMPO, a novel framework that integrates evidence-based medicine into a reinforcement learning reward mechanism. Specifically, they construct a reward model grounded in 4,474 psychiatric literature sources to guide the internal reasoning process of the Qwen3-8B model toward alignment with clinical cognition. Departing from conventional training paradigms focused on linguistic fluency, ClinMPO achieves a diagnostic accuracy of 31.4% on complex psychiatric tasks—surpassing the average performance of 300 medical students (30.8%)—and substantially enhances the professional reasoning capabilities of lightweight models.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
Problem

Research questions and friction points this paper is trying to address.

large language models
psychiatric clinical reasoning
hallucinations
light-parameter LLMs
cognitive alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-guided reinforcement learning
cognitive alignment
light-parameter LLMs
psychiatric clinical reasoning
reward modeling
🔎 Similar Papers
No similar papers found.
X
Xinxin Lin
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
G
Guangxin Dai
Shandong University, Jinan, China
Yi Zhong
Yi Zhong
Beijing University of Posts and Telecommunications
Computer VisionPattern RecognitionSignal Processing
X
Xiang Li
Shandong University, Jinan, China
X
Xue Xiao
Inspur Cloud Information Technology Co., Ltd., Jinan, China
Y
Yixin Zhang
Fudan University, Shanghai, China
Z
Zhengdong Wu
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
Y
Yongbo Zheng
Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
R
Runchuan Zhu
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
Ming Zhao
Ming Zhao
吉林大学
H
Huizi Yu
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
S
Shuo Wu
Shandong University, Jinan, China
Jun Zhao
Jun Zhao
Shandong Normal University
Cyber SecurityCyber Threat IntelligenceHeterogeneous Graph Neural NetworksData Mining
L
Lingming Hu
Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
Y
Yumei Wang
Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
P
Ping Yin
Hong Kong ICI Cloud Service Limited, Hong Kong SAR, China
J
Joey W. Y. Chan
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
N
Ngan Yin Chan
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
S
Sijing Chen
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
Y
Yun Kwok Wing
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong SAR, China
Lin Lu
Lin Lu
PhD student, Nankai University
Conformal inferenceMultiple testing
X
Xin Ma
Shandong University, Jinan, China
Lizhou Fan
Lizhou Fan
Vice-Chancellor Assistant Professor, The Chinese University of Hong Kong
Medical AIHealth InformaticsAI AgentsAI for SciencePsychiatry