AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the domain-specific reasoning capabilities of large language models (LLMs) in anesthesiology—a critical yet underexplored medical specialty. Method: We introduce AnesBench, the first cross-lingual, multi-level benchmark for anesthesiology reasoning, encompassing factual retrieval, hybrid reasoning, and complex clinical decision-making tasks. We propose a domain-specific, multi-tiered evaluation framework and a novel System 1.x hybrid reasoning paradigm. Additionally, we publicly release high-quality bilingual (Chinese–English) datasets alongside resources for continual pretraining (CPT) and supervised fine-tuning (SFT). Contribution/Results: Empirical analysis reveals a nonlinear relationship between model scale and chain-of-thought length; CPT+SFT significantly improves domain accuracy; and inference strategies—including Best-of-N sampling and beam search—enhance decision robustness. Collectively, this work establishes a methodological foundation and empirical evidence base for evaluating and optimizing medical LLMs.

Technology Category

Application Category

📝 Abstract
The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM reasoning in anesthesiology across three levels
Analyze factors affecting LLM performance in anesthesiology
Assess training strategies and reasoning techniques for LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual benchmark for anesthesiology reasoning evaluation
Training strategies: continuous pre-training and fine-tuning
Test-time reasoning techniques: Best-of-N sampling
🔎 Similar Papers
No similar papers found.
Xiang Feng
Xiang Feng
ShanghaiTech University
Neural Radiance FieldsImage Super ResolutionComputer Vision
W
Wentao Jiang
School of Computer Science, Wuhan University, China
Zengmao Wang
Zengmao Wang
Associate Professor, School of Computer Science, Wuhan University
Artificial IntelligenceMachine LearningRemote Sensing
Yong Luo
Yong Luo
Wuhan University
Artifical IntelligenceMachine LearningData MiningPattern Classification and Search
P
Pingbo Xu
Department of Anesthesiology, Zhejiang Cancer Hospital, China; Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang, China
Baosheng Yu
Baosheng Yu
Assistant Professor, Nanyang Technological University
Machine LearningDeep LearningComputer VisionAI for Medicine
H
Hua Jin
Department of Anesthesiology, First People’s Hospital of Yunnan Province, China; Kunming University of Science and Technology, China
Bo Du
Bo Du
Department of Management, Griffith Business School
Sustainable TransportTravel BehaviourUrban Data AnalyticsLogistics and Supply Chain
J
Jing Zhang
School of Computer Science, Wuhan University, China