🤖 AI Summary
Current large language models lack validation mechanisms for treatment planning, often generating coarse, incomplete, or unsafe recommendations. To address this limitation, this work proposes the first self-iterative agent framework tailored for therapeutic planning, which emulates expert revision through a closed-loop reasoning process of generation, evaluation, and optimization. The framework incorporates TheraJudge, a domain-specific evaluation module that dynamically enforces clinical standards during planning. Experimental results demonstrate that the proposed method achieves state-of-the-art accuracy and completeness on HealthBench. In expert evaluations, it outperforms human physicians with an 86% win rate, exhibiting particular strengths in treatment specificity and harm mitigation. Moreover, TheraJudge’s assessments show strong alignment with established clinical benchmarks.
📝 Abstract
Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.