BehaviorSFT: Behavioral Token Conditioning for Clinical Agents Across the Proactivity Spectrum

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Clinical LLM agents suffer from behavioral adaptation imbalance: strong passive responsiveness (e.g., diagnostic reasoning) but deficient proactive intervention (e.g., identifying unmentioned critical information gaps or risks). To address this, we propose “behavioral tokens”—explicit, structured markers that dynamically guide models along a clinical proactivity spectrum, enabling clinically appropriate balance between initiative and restraint. Our method integrates behavioral-token-conditioned supervised fine-tuning, a multi-granularity evaluation framework (BehaviorBench), and a blinded clinical expert assessment protocol. Experiments show our model achieves 97.3% Macro F1 on BehaviorBench; Qwen2.5-7B-Ins attains 96.5% accuracy on proactive tasks, and blinded expert evaluation confirms its interaction style closely mirrors real-world clinical practice. This work pioneers modeling behavioral controllability as a structured token learning problem, establishing a novel paradigm for behavior alignment in clinical LLM agents.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) as clinical agents require careful behavioral adaptation. While adept at reactive tasks (e.g., diagnosis reasoning), LLMs often struggle with proactive engagement, like unprompted identification of critical missing information or risks. We introduce BehaviorBench, a comprehensive dataset to evaluate agent behaviors across a clinical assistance spectrum, ranging from reactive query responses to proactive interventions (e.g., clarifying ambiguities, flagging overlooked critical data). Our BehaviorBench experiments reveal LLMs' inconsistent proactivity. To address this, we propose BehaviorSFT, a novel training strategy using behavioral tokens to explicitly condition LLMs for dynamic behavioral selection along this spectrum. BehaviorSFT boosts performance, achieving up to 97.3% overall Macro F1 on BehaviorBench and improving proactive task scores (e.g., from 95.0% to 96.5% for Qwen2.5-7B-Ins). Crucially, blind clinician evaluations confirmed BehaviorSFT-trained agents exhibit more realistic clinical behavior, striking a superior balance between helpful proactivity (e.g., timely, relevant suggestions) and necessary restraint (e.g., avoiding over-intervention) versus standard fine-tuning or explicit instructed agents.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with proactive clinical engagement.

Need for dynamic behavioral adaptation in clinical agents.

Improving balance between proactivity and restraint in LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral tokens condition LLMs dynamically

BehaviorBench dataset evaluates clinical agent behaviors

BehaviorSFT improves proactive and reactive performance

🔎 Similar Papers

SmartState: An Automated Research Protocol Adherence System