🤖 AI Summary
Clinical LLM agents suffer from behavioral adaptation imbalance: strong passive responsiveness (e.g., diagnostic reasoning) but deficient proactive intervention (e.g., identifying unmentioned critical information gaps or risks). To address this, we propose “behavioral tokens”—explicit, structured markers that dynamically guide models along a clinical proactivity spectrum, enabling clinically appropriate balance between initiative and restraint. Our method integrates behavioral-token-conditioned supervised fine-tuning, a multi-granularity evaluation framework (BehaviorBench), and a blinded clinical expert assessment protocol. Experiments show our model achieves 97.3% Macro F1 on BehaviorBench; Qwen2.5-7B-Ins attains 96.5% accuracy on proactive tasks, and blinded expert evaluation confirms its interaction style closely mirrors real-world clinical practice. This work pioneers modeling behavioral controllability as a structured token learning problem, establishing a novel paradigm for behavior alignment in clinical LLM agents.
📝 Abstract
Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.