🤖 AI Summary
This work proposes Meissa, a lightweight 4-billion-parameter multimodal medical large language model designed to overcome the limitations of existing cloud-dependent medical agents, which suffer from high costs, latency, and privacy concerns that hinder local deployment. Meissa leverages unified trajectory modeling, three-tier hierarchical supervision, and a novel lookahead-retrospect supervision mechanism to distill knowledge from 40,000 curated medical interaction trajectories, enabling the model to autonomously select among reasoning, tool invocation, or multi-agent collaboration based on task complexity. Evaluated across 16 assessments spanning 13 medical benchmarks, Meissa matches or exceeds the performance of leading API-driven agents in 10 tasks, while reducing parameter count by over 25× and end-to-end latency by 22×, and supports fully offline operation.
📝 Abstract
Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.