🤖 AI Summary
Medical robots operating in dynamic clinical environments require robust temporal reasoning, uncertainty quantification, and structured decision-making capabilities. Method: This paper proposes a lightweight multimodal collaborative reasoning framework built upon SmolAgent, integrating vision and speech modalities. It incorporates structured scene graph generation, hybrid retrieval-augmented reasoning, and dynamic tool invocation to enable chain-of-thought inference and interpretable outputs. The framework employs Qwen2.5-VL-3B-Instruct as its multimodal foundation model, with a dedicated orchestration layer ensuring cross-modal alignment and robust inference. Contribution/Results: Evaluated on the Video-MME benchmark and a custom clinical dataset, the framework achieves state-of-the-art performance, significantly improving accuracy and environmental adaptability in surgical assistance and patient monitoring tasks. It establishes a novel paradigm for safe, trustworthy perception and decision-making in medical robotics.
📝 Abstract
Healthcare robotics requires robust multimodal perception and reasoning to ensure safety in dynamic clinical environments. Current Vision-Language Models (VLMs) demonstrate strong general-purpose capabilities but remain limited in temporal reasoning, uncertainty estimation, and structured outputs needed for robotic planning. We present a lightweight agentic multimodal framework for video-based scene understanding. Combining the Qwen2.5-VL-3B-Instruct model with a SmolAgent-based orchestration layer, it supports chain-of-thought reasoning, speech-vision fusion, and dynamic tool invocation. The framework generates structured scene graphs and leverages a hybrid retrieval module for interpretable and adaptive reasoning. Evaluations on the Video-MME benchmark and a custom clinical dataset show competitive accuracy and improved robustness compared to state-of-the-art VLMs, demonstrating its potential for applications in robot-assisted surgery, patient monitoring, and decision support.