Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of deploying emotion recognition on resource-constrained edge devices—namely, stringent requirements for model compactness, ultra-low power consumption, and on-device privacy preservation—existing cloud-dependent or multimodal approaches fail to meet real-time inference demands and hardware constraints. This paper proposes an efficient microcontroller-optimized multimodal emotion recognition framework featuring audio–text late fusion. It integrates a hardware-aware quantized Transformer with a DSResNet-SE acoustic model and introduces frozen keyword embeddings for lightweight, task-specific fusion. Training-to-deployment spectral alignment is ensured via MicroFrontend preprocessing and the MLTK toolchain. Evaluated on the Coral Dev Board Micro, the system achieves end-to-end latency of 21–23 ms and memory footprint of only 1.8 MB, while improving macro-F1 by 6.3% over unimodal baselines. To our knowledge, this is the first work to enable real-time, privacy-preserving multimodal emotion inference on ultra-low-power edge hardware.

Technology Category

Application Category

📝 Abstract
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.
Problem

Research questions and friction points this paper is trying to address.

Deploying emotion recognition on ultra-low-power edge devices
Combining acoustic and linguistic features for multimodal inference
Achieving real-time performance within strict memory constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Late-fusion architecture optimised for Edge TPU
Quantised transformer with frozen keyword embeddings
Spectrogram alignment using MicroFrontend and MLTK
🔎 Similar Papers
No similar papers found.
S
Stavros Mitsis
Department of Computing, Imperial College London, United Kingdom
E
Ermos Hadjikyriakos
Department of Computing, Imperial College London, United Kingdom
H
Humaid Ibrahim
Department of Computing, Imperial College London, United Kingdom
S
Savvas Neofytou
Department of Computing, Imperial College London, United Kingdom
S
Shashwat Raman
Department of Computing, Imperial College London, United Kingdom
J
James Myles
Department of Computing, Imperial College London, United Kingdom
Eiman Kanjo
Eiman Kanjo
Professor, Imperial College London
TinyMLEdge AIDecentralised AICollaborative & Distribuited AIPervasive Computing