🤖 AI Summary
Current large language models (LLMs) exhibit limited safety and adaptability in real-world clinical settings due to insufficient contextual awareness—particularly in identifying critical patient-specific information such as identity, medical history, and risk factors. To address this, we propose MuSeR, a novel multi-dimensional self-optimization framework that jointly integrates attribute-conditioned query generation and knowledge distillation, enabling self-assessment and self-refinement across three core dimensions: clinical decision-making, patient communication, and safety assurance. Leveraging supervised fine-tuning and diverse, realistic medical scenario data, MuSeR significantly enhances the performance of compact models. On the HealthBench benchmark, our approach achieves the highest overall accuracy (63.8%) among open-source models, with 43.1% accuracy on its most challenging subset. Notably, the lightweight student model even surpasses its teacher, empirically validating the effectiveness and generalizability of context-aware enhancement.
📝 Abstract
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs'context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model's context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.