Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

264K/year

🤖 AI Summary

To address the challenges of deploying emotion recognition on resource-constrained edge devices—namely, stringent requirements for model compactness, ultra-low power consumption, and on-device privacy preservation—existing cloud-dependent or multimodal approaches fail to meet real-time inference demands and hardware constraints. This paper proposes an efficient microcontroller-optimized multimodal emotion recognition framework featuring audio–text late fusion. It integrates a hardware-aware quantized Transformer with a DSResNet-SE acoustic model and introduces frozen keyword embeddings for lightweight, task-specific fusion. Training-to-deployment spectral alignment is ensured via MicroFrontend preprocessing and the MLTK toolchain. Evaluated on the Coral Dev Board Micro, the system achieves end-to-end latency of 21–23 ms and memory footprint of only 1.8 MB, while improving macro-F1 by 6.3% over unimodal baselines. To our knowledge, this is the first work to enable real-time, privacy-preserving multimodal emotion inference on ultra-low-power edge hardware.

Technology Category

Application Category

📝 Abstract

Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.

Problem

Research questions and friction points this paper is trying to address.

Deploying emotion recognition on ultra-low-power edge devices

Combining acoustic and linguistic features for multimodal inference

Achieving real-time performance within strict memory constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Late-fusion architecture optimised for Edge TPU

Quantised transformer with frozen keyword embeddings

Spectrogram alignment using MicroFrontend and MLTK

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

2024-09-01InterspeechCitations: 2

Nvidia

base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

US, CA, Santa Clara / US, TX, Austin / US, CA, Remote

Research Scientist Intern, Multimodal Contextual AI (PhD)