🤖 AI Summary
Medical multimodal large language models (MLLMs) suffer from a scarcity of high-quality, logically annotated video data. Manual annotation is prohibitively expensive, while existing synthetic data generation methods suffer from hallucination and lack logical interpretability. To address these limitations, we propose the first neuro-symbolic framework grounded in knowledge graph traversal, formalizing video reasoning as a deterministic spatiotemporal graph traversal process. Our approach integrates visual primitive extraction, dynamic spatiotemporal knowledge graph construction, and path-driven query generation to enable structured, traceable, multi-hop reasoning task synthesis. The resulting large-scale benchmark, M3-Med-Auto, matches expert-annotated data in task complexity and reasoning depth, while significantly improving logical consistency, verifiability, and alignment with model reasoning capabilities.
📝 Abstract
The scarcity of high-quality, logically annotated video datasets remains a primary bottleneck in advancing Multi-Modal Large Language Models (MLLMs) for the medical domain. Traditional manual annotation is prohibitively expensive and non-scalable, while existing synthetic methods often suffer from stochastic hallucinations and a lack of logical interpretability. To address these challenges, we introduce extbf{PipelineName}, a novel neuro-symbolic data engineering framework that formalizes benchmark synthesis as a deterministic graph traversal process. Unlike black-box generative approaches, Med-CRAFT extracts structured visual primitives (e.g., surgical instruments, anatomical boundaries) from raw video streams and instantiates them into a dynamic Spatiotemporal Knowledge Graph. By anchoring query generation to valid paths within this graph, we enforce a rigorous Chain-of-Thought (CoT) provenance for every synthesized benchmark item. We instantiate this pipeline to produce M3-Med-Auto, a large-scale medical video reasoning benchmark exhibiting fine-grained temporal selectivity and multi-hop logical complexity. Comprehensive evaluations demonstrate that our automated pipeline generates query workloads with complexity comparable to expert-curated datasets. Furthermore, a logic alignment analysis reveals a high correlation between the prescribed graph topology and the reasoning steps of state-of-the-art MLLMs, validating the system's capability to encode verifiable logic into visual-linguistic benchmarks. This work paves the way for scalable, low-cost construction of robust evaluation protocols in critical domains.