🤖 AI Summary
This work addresses the challenge of scaling fine-grained, evidence-based annotation in multimodal emotion recognition, where dynamic and misaligned cross-modal cues complicate consistent labeling. To this end, the authors propose a traceable, event-centric multimodal emotion annotation toolkit that first aligns heterogeneous data through preprocessing, then visualizes multimodal signals on an interactive timeline. The system integrates large language models (LLMs) with modality-specific prompt templates to generate structured emotion annotations for human verification. By pioneering the integration of LLMs with traceable event bundles, the approach enables cross-modal consistency checks, substantially improving annotation efficiency and interpretability. Experiments on a VR-based multimodal emotion dataset demonstrate the effectiveness of the proposed workflow, yielding high-quality, structured emotion labels.
📝 Abstract
Multimodal Emotion Recognition (MER) increasingly depends on fine grained, evidence grounded annotations, yet inspection and label construction are hard to scale when cues are dynamic and misaligned across modalities. We present an LLM-assisted toolkit that supports multimodal emotion data annotation through an inspectable, event centered workflow. The toolkit preprocesses and aligns heterogeneous recordings, visualizes all modalities on an interactive shared timeline, and renders structured signals as video tracks for cross modal consistency checks. It then detects candidate events and packages synchronized keyframes and time windows as event packets with traceable pointers to the source data. Finally, the toolkit integrates an LLM with modality specific tools and prompt templates to draft structured annotations for analyst verification and editing. We demonstrate the workflow on multimodal VR emotion recordings with representative examples.