🤖 AI Summary
Quantized large language models (e.g., INT4) suffer severe performance degradation in complex biomedical reasoning under resource-constrained clinical settings. Method: This paper proposes QM-ToT (Quantized Medical Tree-of-Thought), the first ToT framework adapted to biomedicine, integrating quantization-aware path decomposition and a multi-level evaluation mechanism; it further introduces a ToT-guided knowledge distillation strategy requiring only 3.9% of training data. Results: On MedQA-USMLE, QM-ToT boosts accuracy of quantized LLaMA2-70B from 34% to 50% (+16 percentage points) and LLaMA-3.1-8B from 58.77% to 69.49% (+10.72 percentage points), representing relative improvements of 86.27% and 18.26%, respectively. The framework significantly enhances clinical reasoning capabilities of quantized models and empirically validates the feasibility of deploying high-accuracy medical LLMs on edge healthcare devices.
📝 Abstract
Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.