SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing virtual assistant systems suffer from fragmented, pipeline-based processing of voice-trigger detection (VT), device-directed speech detection (DDSD), and automatic speech recognition (ASR), leading to architectural redundancy and suboptimal cross-task generalization. To address this, we propose SELMA—a unified end-to-end multimodal speech enhancement language model that jointly models all three tasks within a single large language model (LLM) framework, leveraging both audio and text modalities. SELMA introduces two key innovations: (1) low-rank adaptive joint fine-tuning for parameter-efficient multitask learning, and (2) global temporal feature pooling to enhance cross-task temporal modeling. Experiments demonstrate substantial improvements: a 64% relative reduction in VT equal error rate (EER), a 22% relative EER reduction for DDSD, and ASR word error rate (WER) on par with dedicated baseline models. SELMA significantly simplifies the conventional multi-module architecture while achieving state-of-the-art performance across all three tasks.

Technology Category

Application Category

📝 Abstract

In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal-Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.

Problem

Research questions and friction points this paper is trying to address.

Virtual Assistant

Speech Recognition

Voice Trigger Detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Processing

Parameter-Efficient Training

Integrated Pattern Recognition

🔎 Similar Papers

No similar papers found.