🤖 AI Summary
This study addresses the challenge that current vision-language models struggle to accurately interpret the involuntary, spatiotemporally dynamic pathological motor behaviors characteristic of epileptic seizures. To bridge this gap, we introduce the first multimodal benchmark dataset dedicated to seizure semiology, comprising 438 videos with over 35,000 dense annotations, and propose a seven-task hierarchical evaluation framework spanning from visual perception to diagnostic report generation. We further develop Seizure-RQI, a clinically interpretable metric for assessing report quality, and conduct baseline experiments using eleven open-source multimodal large language models enhanced by seizure-specific fine-tuning and a two-stage neuro-symbolic hybrid framework. Our approach achieves an F1 score of 0.96 on epileptic versus non-epileptic classification, significantly outperforming end-to-end methods, while exposing systematic deficiencies in existing models regarding lateralization inference and temporal symptom modeling.
📝 Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.