🤖 AI Summary
To address the challenges of deploying emotion recognition on resource-constrained edge devices—namely, stringent requirements for model compactness, ultra-low power consumption, and on-device privacy preservation—existing cloud-dependent or multimodal approaches fail to meet real-time inference demands and hardware constraints. This paper proposes an efficient microcontroller-optimized multimodal emotion recognition framework featuring audio–text late fusion. It integrates a hardware-aware quantized Transformer with a DSResNet-SE acoustic model and introduces frozen keyword embeddings for lightweight, task-specific fusion. Training-to-deployment spectral alignment is ensured via MicroFrontend preprocessing and the MLTK toolchain. Evaluated on the Coral Dev Board Micro, the system achieves end-to-end latency of 21–23 ms and memory footprint of only 1.8 MB, while improving macro-F1 by 6.3% over unimodal baselines. To our knowledge, this is the first work to enable real-time, privacy-preserving multimodal emotion inference on ultra-low-power edge hardware.
📝 Abstract
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.